1 Introduction

Modern computational environment is versatile and distributed in nature with the growth of number of users and their expectations (Robertson et al. 2009). These systems interact more with outside world. They are not limited to a closed room or group of expert users. Increasing complexity is one of the challenges in large-scale distributed systems. Although numerous approaches and methods have been developed by researchers to deal with this ever-increasing complexity in software systems, but a challenge arises in such a system is how to make them workable in a changing environment. Such issues emerge at an operational time (runtime) hence cannot be addressed at the design phase. Self-adaptiveness is a response to address all the issues like high cost incurred in human supervision, managing complexity, managing automation, robustness, and attaining all intended quality requirements within a reasonable cost and time range during operation (Laddaga et al. 2001).

Any internal or external change forces self-adaptive systems to change themselves or to adapt alternative behavior. Some unforeseeable situations, not examine at the time of testing generally called faults and failure evolve internal changes. External changes comprise changing user expectations, changing system context, attackers’ activity, and change in input resources (Andersson et al. 2009). A complete self-adaptive system should acquire all *-self properties, a suggestive solution by IBM known as autonomic computing (AC). These characteristics include self-optimization, self-configuration, self-healing, self-protecting, self-awareness, and context-awareness (Kephart and Chess 2003; Babaoglu et al. 2005).

Self-healing is one of the important aspects of self-adaptive or autonomic computing. It contributes towards adaptiveness dealing with internal changes i.e. system failure or faults. Software systems, capable of healing themselves from system faults and surviving against harmful attacks, improve reliability and maintainability. Self-healing is a notion to gain all these benefits. Self-healing systems reinforce maintainability by supporting active decision-making. These systems detect the faults without human intervention as well as recover from detected faults, as they are recovery-oriented in nature. Moreover, these systems can prioritize multiple faulty components and select the most appropriate recovery plan for healing (Ghosh et al. 2007). A broken and unhealthy state of a system is any unacceptable behavior depending upon user’s preferences and environment changes. No discrete distinction is defined between normal and broken state hence there is always a degraded state separating these two as shown in Fig. 1. However, a system definition uses various parameters for discriminating between healthy and unhealthy state. Figure 1 presents a comprehensive view of the self-healing system. It shows that during systems’ execution, it may jump into an unacceptable state that is called degraded or unhealthy. A module, which is continuously monitoring the system, senses and sends an alarm to the healing component. After a successful recovery action, the system comes back in its normal execution process. The property of healing mechanism is maintained through not affecting other parts of the system. Figure 1 shows that how a self-healing system can mitigate itself by detecting a degraded state (state 3) and then healing it by roll-back recovery mechanism and restarts execution of critical part of dependent states (state 2 and state 3). At this level, decision making is required to initiate the process of mitigation based on the rules and policies defined for the system, and thereby maintain system health. Runtime management of a system takes a large volume of time and is often costly. Therefore, a self-healing approach is anticipated to continuously monitor the system and take suitable and cost-effective action promptly. Systems having self-healing property must be capable of reducing all breakdowns and assuring the availability of applications all the time.

Fig. 1
figure 1

Self-healing system

The objective of self-healing systems is to provide survivable systems with maximum reliability design specification towards preserving availability (Ganek and Corbi 2003). Survivable systems operate correctly without any abrupt change in behavior, despite the existence of a harmful and inconsistent fault. However, to mitigate from isolated fault after detection several approaches have been developed. But if fault influences another part of a system, possibly in distributed systems that require a large number of interactions among components before detection. This can further propagate fault and can lead the system to a non-workable state (Merideth and Narasimhan 2003). Merideth (2003) proposed that survivability of a software system is enhanced with the use of proactive fault-containment.

Hence, in this article authors address the challenges of fault recovery from architectural aspect so as to acquire maximum efficiency. Using multi-agent architecture specific to functionalities, the methodology attempts to address the importance of planning, functioning and enacting as key roles for the success. However, quantifiable metrics to measure survivability over incurred overheads with the selection of suitable failure detection mechanisms and decisions towards effort initiation are few challenges. Self-healing approach provides a framework that considers these issues and makes a system more survivable (Rajput and Sikka 2019).

The rest of this paper is structured as follows. Section 2 provides an exploration of different recovery mechanisms used in various approaches for auto-repair software systems. Section 3 offers an overview of multi-agent architecture with its importance and advantages over other software architectures. It shows a brief comparison between multi-agent architecture, client–server, and service-oriented architecture. In Sect. 4 a survey of multi-agent architecture models is presented. Section 5 represents a detailed multi-agent architecture for self-healing software systems. The section explains each component of the proposed model in detail. Section 6 addresses and explains the model and implementation of proposed architecture based on an information system. Section 7 evaluates the effectiveness of the implemented model in terms of response time and time to recover. Section 8 summarizes this work with some key points towards contribution and future work.

2 Fault recovery

Software faults may occur due to any incorrect step, process, or data definition which results in system behavior in an intended manner. The source and cause of a fault may vary from designing a software system to effect of a malicious attack. Sometimes these faults lead a system failure in which the system is not capable to perform its intended operations. Researchers developed various approaches to handle software faults. Few of them are fault prevention, fault removal, and fault tolerance. First two methods are highly considerable at design phase and testing phase of a system. Fault tolerance offers assurance towards workability in the presence of fault (Arlat et al. 1993).

Fault-tolerant systems are redundant, having alternatives to its operations, one failed use other. Similarly, another approach survivable systems also work with the intension that not to repair the broken part of the system. While fault recovery, an important aspect of self-healing software aims towards diagnosing and repair. Fault recovery is a step taken after occurring the fault.

In Gray (1999) addressed highly available systems, capable of broadly organize themselves during operation state. The author identified one of the most challenging research areas for future computation. Lampson (1999) has called for systems with the ability to adapt a changing context are highly available, always meet their specification, become mature during execution, and progress in an unlimited practical context. Later on, Hennessy (1999) identified some challenges that are crucial than performance for modern computational: availability, maintainability, and scalability.

Approaches used for fault recovery or Recovery-Oriented Computing (ROC) is an attempt to recover from any type of fault occurred during execution time (Patterson et al. 2002). ROC is more concerned about magnifying dependability by reducing system recovery time, supports maintainability thereby increase availability. Fault recovery focuses on the revival of faulty components instead of becoming resilient to not to fail. At the design stage, developers aim to reduce mean time to fail (MTTF) while at the operational stage recovery mechanism attempts to reduce mean time to recover (MTTR). Automated recovery lessens the time needed in human intervention. Modern distributed systems comprise multiple application processes and their interaction over a network. To achieve failure-free computation, rollback recovery provides fault tolerance by storing the state of the processes periodically. As the failure encountered, system restarts a failed process from one of its saved states. One approach in regards to this system recovery says making checkpoints, specific states of the system, stored during execution. Checkpoints are selected in a way so that in case of failure system is capable of regenerating itself (Baker and Sullivan 1992). Log-based recovery, another approach that comes under rollback recovery broadens checkpoint-based recovery with determinants. In a distributed environment where processes are communicating through message passing, rollback recovery becomes complicated due to interdependencies. Rollback propagation occurs due to such dependencies. There is a trade-off between simplicity, robustness, scope of recovery, and overheads required for fault recovery. Pessimistic logging, a simple and robust method induced more overheads whereas optimistic logging reduces overheads incurred but makes recovery complicated and expands extent of rollback (Elnozahy et al. 2002).

Fault recovery has been proven a principal aspect of self-healing. Researchers developed numerous approaches and frameworks. PANACEA (Breitgand et al. 2007) utilizes annotations as a monitoring element and haler agents are appointed for recovery. Healer agents have different procedures to recover from failure. A similar approach is used with a garbage collector to utilize memory efficiently and recovering failure due to lack of memory (Goldstein et al. 2007). In service-oriented software, recovery mechanism deals with service failure, not at component fault (Montani and Anglano 2008). Automated patching is another attempt to address a fault recovery. It ensures not to occur the recovered fault in the future (Azim et al. 2014). SHõWA (Magalhães and Silva 2015) employ a recovery planner based on a defined procedure. Architecture based approaches nurture reconfiguration of the system as a response of recovery from failure (Dashofy et al. 2002; Garlan et al. 2004). Wenbin et al. proposed a prediction based decision support system to analyse system’s behaviour and to take a corrective action. It is an attempt to make the cloud services available. It reduces downtime and hence increases productivity of the system. Parallel simulation on real input helped for predicting health of the system (Dai et al. 2018).

3 Multi-agent architecture

From various viewpoints, three different technologies have been identified that are considered in the context of distributed software systems development. These include client–server architecture, service-oriented architecture, and multi-agent system architecture.

Service-oriented architecture comprises loosely coupled services. Services are offered in the form of simple interfaces that provide abstractions to the user. Only describes service and interaction patterns, all information not relevant to the user is hidden. The business domain adopted it as it provides an inter enterprises relationship. In client–server architecture, with the increase of users, congestion occurs. These systems are expensive in terms of maintainability. Client–server architecture offers a centralized and more secure environment. Multi-agent systems comprise autonomous agents, coordinate each other. These systems are reactive as well as proactive with adaptable nature. SOA provides flexibility by calling service with its name while the agent calls another based on its capability of achieving the goal (Ribeiro et al. 2008). Table 1 contrasts multi-agent systems with client–server and service-oriented approaches.

Table 1 Differentiation between client–server, SOA and MAS

Development of high-quality software for real-world applications is marked as a difficult task. It is one of the outcomes of “essential complexity of software” (Brooks 1995). Further, such complexity incurred due to large-scale and distributed systems having a large number of components and interaction among them (Simon 1996). However, various methods and approaches have been developed to design distributed systems, the study says that those methods rigidly define the interaction between multiple computational parts of a system and have an insufficient mechanism to represent system’s inherent organization (Jennings 2000). For solving large-scale, complex, and decentralized problems, multi-agent systems are very efficient and manageable (Jennings and Wooldridge 2000; Wooldridge 1997; Kamdar et al. 2018). Multi-agent systems are comprised of multiple components called agents. A software agent’s behavior is autonomous and flexible. An agent can take any action without human intervention to achieve its objective for which it is designed. Each agent works as a problem solver with its defined boundaries and interfaces. Comprises of sensors and effectors to receive input and act on the environment in which it is embedded. Moreover, in a multi-agent environment, agents interact with one another. The purpose of the interaction is either to acquire its individual goal or to manage dependencies among all.

4 Related work

To overcome the risk of resource fault in cloud infrastructure, autonomous resource protection and healing can play an important role. Keeping it in mind Meriem and Walid proposed an intelligent cloud infrastructure that uses different recovery tactics to recover from failure of resources. The approach utilizes checkpoints, replication, and migration fault tolerance strategies. The architecture defines each node as a multi-agent system as depicted in Fig. 2.

Fig. 2
figure 2

Multi-agent architecture for intelligent self-healing cloud

However, the infrastructure is not able to address the analysis of failure events in terms of their implications and performance. The authors did not consider any fault tolerance parameter to define algorithm optimally. Despite having chances of failure at the application level, only machine level failures have been taken into consideration. It shows the incompleteness of the proposed framework. Replication approach for machine-level recovery incurred a lot amount of cost as many passive replicas are using hardware resources. Hence, the approach is not cost-effective (Azaiez and Chainbi 2016).

Lee et al. (2005) adopted a multi-agent mechanism and developed an intelligent adaptive system. Two different agents; inference agent and decision agent communicate and perform the best suitable action. The system grows automatically by adding new rules to the self-growing engine. The client module of the proposed system is shown in Fig. 3.

Fig. 3
figure 3

Multi-agent based architecture for self-adaptive system

In earlier architectures, the developers define all checkpoints. Situations that occurred in a dynamic environment are not considered at the development phase due to incapable recovery of the system. The architecture provided by Seunghwa handle such unforeseen situation with generating adaptive rules by a self-growing engine. It achieves adaptiveness in terms of optimization of resources but on a system, which is having a limited number of resources. Each resource is included as a single node in the decision tree.

Essa et al. (2017) presented Enhanced Map Reduce Agent Mobility framework for Healthcare Data Centre. It reduces redundancy and increases reliability of the distributed computing system. The approach utilizes migration recovery mechanism to recover from any machine failure. Each task is defined with a mobile agent. A mobile agent can move from one machine to another machine with code, data, and current state. Agents communicate through interfaces and identifiers as shown in Fig. 4.

Fig. 4
figure 4

EMRAM framework

Many fault-tolerant approaches use redundancy mechanisms and the same as an adaptive system for recovery phase. Redundancy is very expensive as multiple copies of the same resource are maintained. Work showed that EMRAM performed better and provides more reliability than other compared frameworks. Migration task to other machine demands more hardware resources. However, hardware resource utilization is more; the framework tries to overcome with rebalancing the load.

In multi-agent systems, agent collaboration is always an important aspect. Agent collaboration is all about how agents are coordinating, negotiating, and cooperating to achieve their shared goals (Sinha et al. 2019). Due to a dynamic environment and heterogeneity between agents, they face goals and policy conflicts. At agent community-formation level to avoid agents’ conflicts at the early stages, Fatemeh provides a taxonomy (Golpayegani 2015). At the planning level, Wang and Qingshan propose search based optimization in the software adaptive system. The framework utilizes different planning approaches. The framework provides an opportunity to use different planning approaches, as not all changes require the same amount of planning effort. It is achieved with the help of a multi-agent system (Wang and Li 2016).

Context awareness is an important aspect of achieving self-healing criteria that offers detection of contextual changes in the run-time environment (Strang and Linnhoff-Popien 2004). To detect abnormality within the system architecture numerous context observation and detection approaches have been proposed. These approaches and engineering self-adaptive systems are addressed in Strang and Linnhoff-Popien (2004), Cheng et al. (2008), Salehie and Tahvildari (2009) and De Lemos et al. (2013). The fusion of model driven and agent-oriented architecture bridged the gap between high-level and low-level models (Feyzi 2020; Stipancic et al. 2016). Meriem and Walid utilized the autonomous behaviour of agent technology and proposed an intelligent cloud infrastructure. Proposed infrastructure uses checkpoint/replication and migration fault recovery approaches for making services available (Azaiez and Chainbi 2016).

Though fault recovery has been prominent at various stages during different applications and systems build on the principle of software adaptation, effective adaptation and thereby guarantee of very quick recovery is still a challenge. Most of these challenges, as indicated in the literature, are due to the inefficiency in recovery methods, which pose huge problem for successful depiction of software adaptation to fault recovery. Hence, to address this research problem, incorporation of adaptation philosophy as one of the key principle at the time of software design phase is considered as a potential solution criteria.

5 Proposed architecture for self-healing mechanism

The observation made in the existing architectures appears to be justified due to the challenges discovered. Though some of these challenges are application-specific, at times it is necessary to address them for a more suitable and stable adaptive architecture. In this work, we are proposing a framework that provides self-healing and reliability to a software system. To reduce the cost of redundant component and hardware resource utilization, the proposed architecture is making use of rollback mechanism and migration strategy. Based on fault analysis, the system can decide to adopt the most appropriate method for recovery. Hence, to achieve the goal of self-healing, multi-agent architecture is going to be considered as a solution aspect to be adapted in the design phase of the system architecture. The strength of multi-agent architecture towards achieving autonomous functioning stimulates us to make use of it (Chainbi 2005). The framework is not only providing self-healing autonomic property but also considering resource utilization and low computational overheads.

In case of failure, the framework will categorize fault occurred and take appropriate action. The aim is to reduce time in recovery so that system can provide service with minimal interrupt. The proposed framework is depicted in Fig. 5.

Fig. 5
figure 5

Proposed multi-agent framework for self-healing

The proposed architecture has huge significance in various applications similar to the areas of ambient intelligence, humanoid/autonomous robots, autonomous software systems and smart city planners etc. where the complexity of tasks selection takeover the decisions made for providing most viable outcome (Ravulakollu et al. 2016). Applications in the context of ambient intelligence such as smart homes, where adaptation of domestic ambience against multiple functionalities with variable tasks for specific users can be provided. An inspection into case like setup of domestic ambience specific to time (home time, sleep time, family time, office time and personal time), user (child, youngster and adult) and activities (fun, study, fitness, relax, cooking etc.) can be modelling with the help of proposed architecture. The fitment of the architecture to such complex application yet limited in size can be made feasible.

5.1 Architecture description

In the proposed framework, each module comprises multiple nodes. Each node may contain one or more functionality. Functionality can be thought of service in a service-oriented architecture. Each module is having one module manager. Module manager contains a diagnosis agent and planning agent. Diagnosis agent is finding the root cause of the fault that occurred whenever it is reported by detecting the agent of any service node. Task of a planning agent is to search for the best suitable solution from a solution pool. It is utilizing a recovery mechanism to complete the task without interrupting other services. If the infected process is sharing services from different nodes, the module manager is taking care of collaboration from each node to take the system back in a healthy state. Apart from within the module, module manager is also responsible for communicating with module manager of another module if the process is utilizing services from different modules. Fault recovery is achieved at a service node by enacting the solution. Enacting agent is executing a selected solution with a set of defined action. The framework is using multi-agent system at the application level as well as at a healing level. However, in this work, we are exploiting multi-agent architecture at the healing aspect. Entire application system can be thought of as a set of agents furthermore each module defines an agent.

The proposed architecture comprises of following constituents:

Agent, \(A\): Finite non-empty set of agents

Action, α: Finite set of actions to be performed; each action belongs to at least one agent, \(\alpha \to \{ A,c,o\}\) that follow some constraints, c, and have specified outcome, o.

Task, T: Set of tasks to be completed by the system; T utilizes a set of actions to achieve a specified objective; \(\to (\alpha^{\prime},O),\) where \(\alpha^{\prime} \subseteq \alpha\). O is the objective to be achieved and can be defined as Eq. 1:

$$O = \mathop \sum \limits_{i = 1}^{{\alpha^{\prime}}} o_{i} .$$
(1)

5.1.1 Type of agents

Monitoring and detecting agent, \(M_{D}\): A set of agents called monitoring agents \(M_{D} \subseteq A\) that performs; \(M_{D} :\left| {T - T^{\prime}} \right| \to \Delta\), where \(T^{\prime}\)-current state of task T being executed. Function of \(M_{D}\) can be defined as Eq. 2:

$$\Delta = \mathop \sum \limits_{i = 1}^{{\alpha^{\prime}}} \left( {\left| {c - c^{\prime}} \right| + \left| {o - o^{\prime}} \right|} \right)$$
(2)

where \(c^{\prime}\) and \(o^{\prime}\) are the current state for corresponding action.

Diagnostic agent, \(D\): A set of agents called diagnostic agents \(D \subseteq A\) that performs; \(D:\partial (\Delta ,T^{\prime}) \leftarrow \left( {\left\{ {A|\alpha } \right\},\varepsilon } \right)\). The function will return a tuple that comprises a pair of {agent: action} and type of error generated, ε. It performs some diagnosis \(\partial\) on \(\Delta\) and \(T^{\prime}\) as shown in Eq. 3:

$$\left| {\begin{array}{*{20}c} {\left\{ {A|\alpha } \right\}} \\ \varepsilon \\ \end{array} } \right| = \partial \left( { \Delta ,T^{\prime}} \right).$$
(3)

Planning agent, \(P\): A set of agents called planning agents \(P \subseteq A\) that performs; \(:T \to \left( {\alpha^{\prime\prime},O} \right)\), where \(\alpha^{\prime\prime} \subseteq \alpha\) and \(\alpha^{\prime\prime} \ne \alpha^{\prime}\). Equation 4 shows that summative objective of \(\alpha^{\prime}\) and \(\alpha^{\prime\prime}\) are the objective of task, T:

$$O = \mathop \sum \limits_{i = 1}^{{\alpha^{\prime\prime}}} o_{i} = \mathop \sum \limits_{i = 1}^{{\alpha^{\prime}}} o_{i} .$$
(4)

Enacting agent, E: Set of agents called enacting agents \(E \subseteq A\) that performs; \(E:T^{\prime} \times \alpha^{\prime\prime} \to T\).

5.2 Module description

5.2.1 Monitoring and detecting agent (\(M_{D} : \left| {T - T^{\prime}} \right| \to \Delta\))

The monitoring module of the system collects all data relevant to define process’ current state. State of the system is defined with some pattern. Detecting agent is applying some methodology to identify the presence of a fault. Any change in the pattern is an assurance of failure. Any situation that is interrupting to achieve its task may be considered as a failure and cause of it as a fault. If any non-acceptable situation occurs, the detecting agent sends an alarm to the diagnosing agent with collected data. For failure detection, use of multiple parameters in terms of system performance and policies is made. Abnormality, which is considered, includes:

Crash: The cause of failure is due to an inaccessible or invalid operation, resource, and arguments that lead towards non-availability of the resource. It may be at operating system level if trying to perform not allowed operation. Many a time trying to access an unauthorized resource like segmentation fault which occurs due to attempting a memory block, not allowed for a specific process.

Run-time error: Run time errors are traceable with error messages. Making a call to a function with bad parameters leads to exceptions. NULL values for an operand stops a process and guide to failure. Out of memory is another example of run time error. Apart from error messages a threshold check for each parameter of the process like load time, response time, etc. is used to monitor such run time errors.

Communication connection: In a distributed system, different components are communicating and sharing information to complete a process. Any issue generated in communication can stop completing a process and crash it. Monitoring system considers a threshold on response time, acknowledgment approach to detect such problems.

Resource over allocation: Using a resource for a very long time without responding neither in the form of a result nor in the form of an error message. A threshold-based approach is here to identify those anomalies.

Throughput: If the number of tasks per unit of times is going to be zero, there is a chance of occurring a fault. To identify these faults monitor system is using component level throughput.

Functioning of the monitoring agent is depicted in Algorithm 1.

figure a

5.2.2 Diagnosing \((D:\partial \left( {\Delta ,T^{\prime}} \right) \to (\{ A|\alpha \} ,\varepsilon ))\) and planning (\((P:T (\alpha^{\prime\prime},O))\)) agent

Diagnosing module of the framework is responsible for receiving all detection information from detecting agents and perform analysis to measure out type and actual cause of failure. The module interprets low-level monitored information. The module ensures that a component requires healing after discovering the source of a fault. Different type of anomaly may have non-identical solution hence it also determines the type of fault occurred. Based on available information from detecting agent, diagnosis agent returned component, associated action, and type of anomaly. Algorithm 2 is showing the behaviour of the diagnosing agent.

figure b

The planning agent of the framework receives all diagnosed information from the diagnosing agent and sends the request to the enacting agent after finding a suitable solution to the problem. The module is responsible for answering how we recover from a faulty state. It is also selecting the best recovery action from a set of possible alternatives. A single recovery mechanism may not be feasible for different types of failures at different components. However, we have considered various recovery approaches for each category of fault. In case of crash anomaly, the system will utilize an alternative resource mechanism in which the same task will be carried out on another resource. In the presence of run-time error, the agent will prefer to provide a solution with user-defined exceptions. Communication error recovery is designed with a retry/checkpoint mechanism. The system will take advantage of the migration strategy in case of a resource over allocation situation or if the throughput of the system is zero. The working of a planning agent is shown in Fig. 6.

Fig. 6
figure 6

Multi-agent approach for self-healing

Whenever the planning agent is recommending a solution, it is also requesting feedback from the enacting agent. The feedback is providing the suitability of solution based on different evaluation criteria. Each solution is assigned a rank based on received feedback. The rank of a solution will decide the probability of choosing it in the future if same problem occurs. The planning agent evaluates each action of a faulty agent with a function \(isFaulty()\) and returns true if the current action is influenced by an error occurred. The algorithm also utilizes another function called \(substitute()\) that returns alternative or solution to the anomaly. If multiple solutions exist \(maxRank()\) will return the solution having maximum rank provided by feedback of enacting agent. Algorithm 3 depicts the working of a planning agent.

figure c

Enacting module executes the solution through the execution of actions instructed by the planning agent. The solution is provided with the required information to be implemented. It needs a set of action to be performed with respective agent information and resources available. The enacting agent will send the feedback message to the planner. If the solution can complete the task with the intended objective it will send a positive message otherwise a negative message with a negative rank will be sent. The rank of a solution depends on how efficiently the task has been achieved. Different factors like resource consumption, time, delay, etc. will decide efficiency of a solution. Working of the enacting agent is depicted in Algorithm 4 as shown below.

figure d

6 Experimentation

6.1 Experimental design

To validate proposed architecture a multi-tier agent-based application is developed. An Application is designed with an agent-oriented design approach where each module is implemented as an agent entity. In this experiment, detecting and enacting functionalities are encapsulated as an agent where with the help of various services the functionality of the system can be performed as shown in Fig. 6. The control flow mechanism within the agent is similar to the mechanism referred to in Fig. 5. Beyond the traditional task listed for the agent, the planning agent is enabled for multiple tasks such as alternate resource, retrying, checkpoint recovery, and migrating tasks to other agents based on the type of fault detected. One assumption of developed application is that it is not supporting multithreading criteria just to remove the resource utilization due to redundancy. Hence, a single thread of each agent is running at a time.

The experiment is implemented with the JADE 4.5.0 platform. Communication between agents is performed with FIPA specification by sending ACL messages. A control agent is initiating communication with other agents to perform a task and all other agents are responding with the requested information within given constraints.

6.2 Self-healing goals

The system is designed in a way to process each request and provide requested data within constraint. We identify the following objectives for self-healing:

  • Availability: It ensures the availability of information when requested from any agent. If the information is not available through normal route of resources, alternative routes or resources will help in the healing process. The goal of this quality attribute is to maximize its potential without considering the operational cost. It makes the system reliable as guaranteeing that the system is working properly.

  • Maintainability: It ensures a non-repeating fault situation. The system should be repaired in a way that the probability of reoccurring the same failure is minimized. Maintainability offers the system to increase the utilization of available resources.

  • Survivability: Assurance of making system workable during fault recovery mechanism without interrupting other processes. This quality goal increases the throughput of the system. Throughput is the process of request per unit of time.

6.3 Self-healing strategies

Self-healing strategy is a set of actions to be taken to repair the system post occurring the fault. The set of actions changes the functionality of the system to perform the intended task. Each strategy is chosen based on the fault category. The planning agent is responsible to choose which is the most suitable strategy to adopt based on diagnostic information.

The developed system is focusing on a set of faults under the category of crash, run-time error, communication failure and resource over allocation. Based on the above type of fault, the planning agent may select one of the following strategies:

  1. 1.

    Alternative resource/retry: If the requested resource is not available then the system will retry the same operation with an alternative resource. This strategy is applicable when any system-related crash is detected.

  2. 2.

    Checkpoint recovery: Some checkpoints are defined at the time of design and is utilized during run-time error occurs. The system will roll back the process from a nearby checkpoint.

  3. 3.

    Migration to other agents: If the agents are not able to interact due to communication failure or any agent is not responding in the given period, the task will be migrated to other agents that offers similar services to complete the task.

An instance of agents’ communication control flow is depicted in Fig. 7. It shows that five agents are communicating to accomplish the task without any fault occurrence. A control agent, AD looks for some services and gets the agents; AE, AF, and AA, which are providing intended services in the message0 to message6. The control agent is sending proposal messages from message7 to message9 to AE, AF, and AA to get the information. All the coordinating agents are requesting information from supporting agent, AFR through message12 to message14 that fetches the data from persistence storage. As the complete process does not face any faulty situation the information is sent back to the coordinating agents through message15, message19, and message23 and then to the control agent, AD through message16, message20, and message24.

Fig. 7
figure 7

Normal agent communication control flow without fault

Another instance of the system is shown in Fig. 8. It demonstrates the scenario where the system is dealing with crash and run time error.

Fig. 8
figure 8

Agent communication control flow for fault type; crash and run time error

The system is recovering from resource unavailability by rolling back the system with an alternative resource location. Supporting agent AFR is looking for persistence resource but not getting accessibility therefore responding in the form of REFUSE to coordinating agents AE, AF, and AA through messgae13, message15, and message17. To get the task done coordinating agents are sending the proposal message with different arguments as an alternative resource through message14, message16, and message18. AFR is getting access and providing the details through message19, messae23, and message27.

To demonstrate the migration strategy, another scenario of the same task is executed by injecting the fault; resource over-allocation, and communication failure. The two faults are injected by making the agent unavailable to serve the request. Figure 9 shows that the control agent sent a proposal message to AA in message7. However, AA could not respond within the set period hence it migrated the task of AA to AE through message12.

Fig. 9
figure 9

Agent communication control flow changes for fault type; resource over allocation and communication connection

7 Evaluation

Validation of the developed architecture is carried out using an information processing system that uses the exchange of various customer records between multiple modules used within the system. The system is customized to run this case study with limited agents and restricted functionality. The success of a transaction is determined based on the success of accurate information delivery at the final module in the system. To evaluate the architecture records of 1000 in volume are used with 5 different agents connected with sequential and uni-thread based functionality. After successful execution evaluation of the system is carried out using the following metrics:

Response time Response time is the total time taken to execute the transaction successfully if the first metric used to evaluate the performance of architecture. To calculate the response time for a request two different timestamps \((T_{A} )\) arrival time and \((T_{C} )\) completion time has been recorded and the difference is calculated as given in Eq. 5:

$$Response \, Time = T_{C} - T_{A}$$
(5)

Figure 10 demonstrates the functionality of the system in terms of response time if there is no fault occurs during the completion of task. The control agent is continuously sending a request to fetch data with the help of coordinating agents. It shows that response time, in the beginning, is very high against the response times generated later. This is due to overheads incurred to get the services registered with the platform. Once it is registered then in subsequent requests, overheads are reduced that can be observed from Fig. 10 along with constant maintenance of a similar range of response time.

Fig. 10
figure 10

Response time of the system without fault

To validate the behaviour of system, faults are injected at various stages during execution, as a supporting agent is not able to find the resources. After the successful running of a series of inputs, the agents successfully represent communication among themselves and maintain a similar level of communication irrespective of any fault occurrence. Representation of alternate resource strategy shown in earlier Fig. 8 is successfully validated and the results obtained are reflected in the following Fig. 11a. It demonstrates the scenario where the system is dealing with crash and run time error. The system is recovering from resource unavailability by rolling back the system with an alternative resource location. If the requested resource is not available then the system will retry the same operation with an alternative resource. To demonstrate the migration of task strategy, the system is run for the same number of iterations, and during iterations, faults are injected in the form of communication failure or the intended agent is not available. In all these faults, detection is performed based on the criteria of time-stamp. If any agent is not responding, in the defined period, this state will be considered as unhealthy and a recovery action is taken by migrating the task to any available agent that can perform the task. Figure 11b indicates the resource maintainability aspect in terms of availability. The graph indicates that the response time for identifying the availability of the resource is the maximum consumption part, upon the availability of resources the response time comes to its minimum. Similarly, with the number of iteration, the behaviour of response time trying to move towards the lower side of response time is visible with an association of resource allocation algorithm for greater performance.

Fig. 11
figure 11

Response time a for crash and run time error, b resource over allocation and communication failure

When it comes to the system performance, a cumulative response time that can project the strand of response time with respect to the system’s behavior needs to be measured, it can be observed from Fig. 12 that shows the mean response time with or without fault. To comprehend the system performance mean response time is measured for fault and non-fault cases. As shown in Fig. 12a for the category of resource unavailability, the variation in mean response times is in the range of 100–150 ms indicating the scope of improvement from external factor is limited. On the other hand in case of the over-allocation category as shown in Fig. 12b, the mean time varies in the range of 200–1200 ms.

Fig. 12
figure 12

Mean response time a with resource unavailability, b with resource over allocation

The variation indicates system performance in terms of fault recovery is mainly associated with identifying the resource rather than distributing the resource. Hence roll of resource allocation algorithm in this context is very essential for optimizing response time for leveraging towards non-faulty cases.

Time to recover the second metric used to evaluate the performance of architecture is time to recovery. Quantifiable performance of self-healing property is the amount of time consumed to take the system back from any degraded state to a healthy state. The time required to perform a recovery mechanism is considered as time to recover \((T_{tr} )\).

However, the self-healing mechanism considers the amount of time used to detect, plan and recover system back to normalcy as time to heal \((T_{th} )\). Time to recover is the amount that is consumed during recovery mechanism and can be defined as follows:

$$T_{tr} = T_{th} - \mu$$
(6)

where, \(T_{th}\) is the total time consumed by healing mechanism and \(\mu\) are overheads incurred due to the interaction of different components. Figure 13 shows the time consumed by the recovery step in the healing process if any fault occurs during execution. As shown in Fig. 13a which considers resource unavailability as the cause of fault, mean recovery time is recorded as 1000 ms. While in case of a resource over allocation category, 110 ms are found to be required for recovery step as depicted in Fig. 13b.

Fig. 13
figure 13

Time to recover a with resource unavailability, b with resource over allocation

Performance of the architecture recorded is a self-evaluated outcome that is derived from the application aspect. However, to benchmark the methodology it is also necessary to validate against similar architectures. Due to the difficulty in non-availability of existing architecture for evaluation, a continuation of this research is carried out in the next paper. Future work indicates not only the evaluation of proposed model against the existing but also, on insights into strategies validation and selection for agent to identify the effective path for fault recovery.

8 Conclusion

The paper attempts to address the challenges of self-healing phenomena in self-adaptive systems. The authors address the challenge by proposing a non-traditional solution that can be associated with the development of the system. A novel architecture can be incorporated during the design phase that adopts a multi-agent communication mechanism to deploy self-healing criteria. The proposed solution is detailed from the implementation perspective using three-different algorithms for planning, functioning, and enacting as critical responsibilities of the agent. With the help of agent communication control flow, the functionality of agents’ communication and collaboration in terms of addressing the faults is effectively represented and their validation is carried out to quantify the extent of achievement. To justify the success of architecture, evaluation metrics such as response time and time to recovery is measured over a sample population of 1000 entities for two different categories of faults. Based on the results obtained, it is concluded that the architecture is effective especially for crash and resource unavailability faults. On the other hand, in case of faults associated with resource over allocation and communication failure, there is scope for improvement.

Testing and validations conducted through this paper are primarily associated with the limited system that has fewer agents with fewer communications processing sequentially with single-thread approach. Through the validations are successful at this stage, there is still scope for replicating the model to larger systems with multi-level communication that has multiple threads. In addition, to make the agent perform better in terms of optimizing the healing times, usage of machine learning approaches is recommended. On a learning platform especially using reinforcement learning, there is scope for the architecture to be more effective and flexible. Another scope for future work is applicability of proposed architecture in commercial applications. For which, through an exhaustive study on existing architectural strategies is necessary followed by fusion of proposed architecture into the existing may result into a more efficient outcome.