1 Introduction

In the past several years, evident advances in mobile robotics have occurred and roboticists have increasingly turned to Artificial Intelligence (AI) techniques to endow robots with perception, reasoning, planning and learning capabilities. Even though single agent solutions are still one step ahead of general multi-agent solutions, with the robustness inherited by distributed AI, multi-agent systems have been increasingly proposed, providing tools for the development of complex systems and mechanisms for coordination of the behavior of independent agents (Stone and Veloso 2000).

Considerable scientific work presented recently span across the boundaries of Robotics and AI, and in the particular case of multi-robot systems (MRS), these works are usually verified through multi-agent simulations or controlled test scenarios. However, recent attempts to employ robot teams in real world environments have become evident, e.g. as described in the works of Iocchi et al. (2011), Pippin et al. (2013), Agmon et al. (2012), and more. Successful examples of solutions that have proliferated in public places include automated guided vehicles (AGVs)Footnote 1 and the Santander Interactive Guest Assistants (SIGA),Footnote 2 and teams of multiple robots have been increasingly used in military and security applications, taking advantage of space distribution, parallelism, task decomposition and redundancy. In this context, there have been several advances in multi-robot patrolling and coverage, map learning, graph-exploration and networked robots (Parker 2008).

Security applications are a fundamental task with unquestionable impact on society. Combining progress witnessed at the behavior-level with the technological evolution observed in the last decades, it becomes clear that robot assistance can be a valuable resource in surveillance missions. In this article, we propose a distributed multi-robot solution for indoor patrol. We believe that employing teams of robots for active surveillance tasks has several advantages over, for instance, a camera-based passive surveillance system. Robots are mobile and have the ability to travel in the field, collect environmental samples, act or trigger remote alarm systems and inspect places that can be hard for static cameras to capture. These capabilities free human operators from executing tedious, repetitive or monotonous tasks, e.g. mine clearing or patrolling in hazardous environments, enabling them to be occupied in nobler tasks like monitoring the system from a safer location (Murphy 2004) and taking advantage of robots’ expendability.

The proposed multi-robot patrolling solution is based on Bayesian reasoning and learning agents, which can adapt to the dynamics of the system. We firstly conduct simulation experiments and then validate our results in a large real world environment. We show that the approach can be applied in different types of environment, independent of its topology; withstands failures in robotic patrol units; can be performed with heterogeneous robots; and accomplishes exceeding performance. Therefore, its potential is shown as a solution for real world MRS.

In the next section, a literature review is conducted and the contributions of the article are described. Afterwards, the multi-robot patrolling problem is defined and the performance metric is presented. Section 4 describes the distributed multi-robot patrolling strategy proposed in this article. Section 5 presents the results obtained both through simulations and hardware experiments, as well as a discussion of the results obtained. Finally, the article ends with conclusions and open issues for future research.

2 Related work

To patrol is literally defined as “the activity of going around or through an area at regular intervals for security purposes” (Vocabulary.com 2015). The multi-robot patrolling problem (MRPP) requires coordination of agents’ decision-making with the ultimate goal of achieving optimal group performance, while visiting every position in the environment (or at least those that need surveillance). Additionally, it aims at monitoring infrastructures, obtaining information, searching for objects, detecting anomalies and clearing areas in order to guard the grounds from intrusion. Consequently, performing a patrolling mission with a team of any given number of autonomous and cooperative robots distributed in space represents a complex challenge. We propose to solve the area patrolling problem with a team of arbitrarily many robots, focusing on intelligent strategies for coordination of the team in order to visit all the surveillance points that need vigillance inside a target area, as effectively as possible. Thus in this context, we consider the problem of border/perimeter patrol (Elmaliach et al. 2008; Marino et al. 2013), and adversarial patrol (Basilico et al. 2009; Agmon et al. 2011a) out of the scope of this work.

Despite its high potential utility in security applications, only recently has the MRPP been rigorously addressed. The environment to patrol is commonly abstracted through a navigation graph (cf. Fabrizi and Saffiotti 2000) and numerous works explore graph theory tools like spanning trees Fazli et al. (2013), Gabriely and Rimon (2001) or graph partitioning (Sak et al. 2008; Stranders et al. 2012) to compute minimal-cost cycles that assign efficient routes for each robot in the patrolling mission. Auctions and market-based coordination are also popular among multi-robot patrolling literature as in the case of Pippin et al. (2013) and Poulet et al. (2012), where agents bid to exchange vertices of the patrol graph to increase overall patrol performance. Numerous other concepts have already been explored, such as task allocation (Sempé and Drogoul 2003), artificial forces (Sampaio et al. 2010; Jansen and Sturtevant 2008), Gaussian processes theory (Marino et al. 2012), evolutionary algorithms (Aguirre and Taboada 2012), linear programming modeling (Keskin et al. 2012), and others.

Most contributions to the literature focus on presenting strategies to patrol any given environment with a team of mobile agents, the majority verified through simulations. Nevertheless, several important theoretical contributions to the MRPP have been presented and using the idleness criterion (cf. Sect. 3) it has been shown that the problem is NP-Hard (Chevaleyre 2004; Ruan et al. 2005). Based on a topological representation of the environment and using global/centralized information, it has been proved previously that optimal patrolling could eventually be obtained for the single robot case by solving the Traveling Salesman Problem (TSP) (Chevaleyre 2004), a classical computationally complex problem. As for the multi-agent situation, the problem is even more complex and optimal performance depends on the environment topology and the number of robots. However, the literature commonly reports that theoretically superior performance can be obtained either by optimal k-way graph partitioning, especially for high number of agents or graphs with unbalanced edges (Pasqualetti et al. 2012; Agmon et al. 2011b); or having all robots following the same TSP cycle, equally distributed in time and space, especially for low number of agents or balanced graphs (Chevaleyre 2004; Smith and Rus 2010). Both the TSP and the graph partitioning problem are NP-hard. Therefore solving these problems is non-trivial, particularly in sparse topologies, which is the case of most real world environments. For instance, TSP cycles may not even exist in such situations (recurrent in non-complete graphs) and the tour should be reduced to a minimum cost closed walk in the graph (Portugal et al. 2014). Additionally, these solutions lead to predefined trajectories for all robots. Commonly, these solutions cannot adapt to non-standard situations (e.g. robot failures or heterogeneous agents with different speed profiles) without specific recovery behaviors. Furthermore, such recovery behaviors need to be triggered by a centralized entity with global knowledge about the system. Another disadvantage is that deterministic patrols allow an external observer to easily apprehend the trajectory pattern. Thus, we turn to probabilistic strategies with distributed intelligence and autonomous decision-making robots. Besides being non-deterministic and much less predictable, these strategies do not have a central point of failure and can potentially scale to larger teams.

The creation of adaptive behaviors that allows agents to learn how to effectively patrol a given scenario have shown to be extremely promising. Moreover, such adaptability allows the agents to face changes in the system or in the scenario. Nevertheless, the use of such techniques is far from being straightforward. Certain works in this field have adopted machine learning methods. For instance, the pioneering approach of Santana et al. proposed to model the MRPP as a Q-learning problem in an attempt to enable automatic adaptation of the agents’ strategies to the environment (Santana et al. 2004). In brief, agents have a probability of choosing an action from a finite set of actions, having the goal of maximizing a long-term performance criterion, in this case, node idleness. Two reinforcement learning techniques, using different communication schemes were implemented and compared to non-adaptive architectures. Although not always scoring the best results, adaptive strategies were generally superior to other compared strategies in most of the simulation experiments reported. The main attractive characteristics in this work are distribution (no centralized coordination is assumed) and the adaptive behavior of agents, which is usually highly desirable in this domain. Lauri and Koukam (2014) built upon the work of Santana et al. by redefining the state space and reward function associated to each agent, and were able to improve the method in specific situations, depending on graph topology and number of agents. Similarly, Ishiwaka et al. (2003) proposed reinforcement learning to predict the location of teammates as well as the movement direction to a common target. In another surveillance application scenario, Ahmadi and Stone (2006) proposed an adaptive solution for area sweeping tasks based on decentralized partitioning and learning behaviors. This was verified in a small experimental arena with two AIBO robots endowed with cameras. However, as the robots had a high field of view, the focus was not primarily on planning exhaustive surveillance routes, but more on continuously moving the agents to positions where they were able to observe from distance.

Alternatively to reinforcement learning and adaptive behaviors some strategies have followed stochastic approaches that benefit from probabilistic decision-making to overcome the deterministic nature of classical patrolling applications. For instance, in Marier et al. (2010), the patrolling problem is cast as a multi-agent Markov decision process, where reactive and planning-based techniques are compared. The authors concluded that both perform similarly, with the latter being slightly superior in general, since it looks further ahead than the former, which is purely local. However, the reactive technique runs much faster, suggesting that a simple and computationally cheaper approach can be used in many applications, instead of more complex strategies which only perform slightly better. Similarly to reactive agents, swarm-based solutions to the patrolling problem have been studied in Yanovski et al. (2003), Baglietto et al. (2009) and Cannata and Sgorbissa (2011), where it is assumed that agents have limited computational power, communication abilities and memory storage. Interesting collective patrol behavior emerges using these type of approaches.

In this article, a new distributed and adaptive approach for multi-robot patrol is presented. Each robot decides its local patrolling moves online, without requiring any central planner. Decision-making is based upon Bayesian reasoning on the state of the system, considering the history of visits and teammates’ actions, so as to promote effective coordination between patrolling agents. Experimental results illustrate the advantages of using the proposed distributed technique, when compared to several state of the art strategies in simulation, and the well-known TSP solution in real world results.

2.1 Contributions

To summarize, the contributions of the work to the state of the art are as follows:

  • We propose a new probabilistic, distributed and scalable approach to solve the MRPP, whose effectiveness is attested in the experiments conducted;

  • We employ Bayesian decision-making together with adaptive learning in the context of the MRPP for the first time, as far as our knowledge goes, yielding intelligent patrolling behavior within the robot team;

  • We show the advantages of using the proposed algorithm by assessing its performance and effectiveness using realistic simulations in Stage/ROS (Quigley et al. 2009), and comparing these results with those reported in Portugal and Rocha (2013a) where several patrolling approaches were tested in the same conditions;

  • We implement the approach in a large indoor scenario with up to six physical robots, and compare it with the classical multi-agent TSP approach, which is theoretically optimal using one robot and nearly-optimal for generic teams of robots;

  • Fault-tolerance and scalable behaviors are verified in real world experiments, as well as the possibility to use the approach in heterogeneous teams of robots.

In the past, we have shown that naive Bayesian decision models have the potential to tackle the patrolling problem, by presenting methods based on conditional probability distributions and without agent learning behaviors (Portugal and Rocha 2013b). In this work, we intend to go much further by presenting a new strategy that rectifies weaknesses previously identified. We describe a probabilistic multi-robot patrolling strategy, where a team of concurrent learning agents adapt their moves to the state of the system at the time, using Bayesian decision rules based on the robot’s accumulated experience and distributed intelligence. When patrolling a given site, each agent evaluates the context and adopts a reward-based learning technique that influences future moves. Learning of the likelihood distributions involved in the robots’ decisions is done online, along the mission, and robots are constantly updating their prior information, i.e. their past experience.

This article is a culmination of two preliminary works (Portugal et al. 2013a, b), yet it clearly distinguishes itself from the previous publications by presenting an extended literature survey and a more complete Bayesian model, which includes vertex lookahead (cf. Sect. 4.4) in order to solve the MRPP using multi-agent concurrent reward-based learning. In addition, the implementation of a fully distributed system is described, carrying out completely new experiments both in simulation and in a real world indoor infrastructure and discussing in detail the implementation aspects. Also, unlike the two previous works, we assess through real world experiments the adaptability, scalability and fault-tolerant nature of the proposed approach.

3 Problem definition

In this article, the problem of efficiently patrolling inside a given area with an arbitrary number of robots is studied. Agents are assumed to have an a priori representation of the environment and through a graph extraction algorithm (Portugal and Rocha 2013c), they are able to acquire an undirected, connected and metric navigation graph \(\mathcal {G}=(\mathcal {V},\mathcal {E})\), which enables them to assess the topology of its surroundings. \(\mathcal {G}\) is composed of vertices \(v_{i}\in \mathcal {V}\) and edges \(e_{i,j}\in \mathcal {E}\). Each vertex represents a specific location that must be visited regularly and each edge represents the connectivity between these locations, having a weight \(|e_{i,j}|\) defined by the metric distance between \(v_i\) and \(v_j\). \(|\mathcal {V}|\) and \(|\mathcal {E}|\) represent the cardinality of the set \(\mathcal {V}\) and \(\mathcal {E}\) respectively.

Informally, a good strategy is one that minimizes the time lag between two passages to the same place and for all places. Thus, the MRPP can be reduced to coordinate robots in order to visit frequently all \(v_{i} \in \mathcal {G}\), ensuring the absence of atypical situations with regard to a predefined optimization criterion.

In order to address and compare the performance of different patrolling algorithms, it is important to establish an evaluation metric. Diverse criteria have been previously proposed to assess the effectiveness of multi-robot patrolling strategies. Typically, these are based on the idleness of the vertices (Almeida et al. 2004), the frequency of visits (Agmon et al. 2011b) or the distance traveled by agents (Iocchi et al. 2011). In this work, the first one has been considered, given that it measures the elapsed time since the last visit from any agent in the team to a specific location. The idleness metric uses time units, which is particularly intuitive, e.g. in the analysis of how long vertices have not been visited for, or comparing different patrolling strategies. In the following equations, we define important variables used in the remaining sections of the article.

The instantaneous idleness of a vertex \(v_{i}\in \mathcal {V}\) in time step t is defined as:

$$\begin{aligned} \mathcal {I}_{v_i}(t) = t - t_l, \end{aligned}$$
(1)

wherein \(t_l\) corresponds to the last time instant when the vertex \(v_{i}\) was visited by any robot of the team. Consequently, the average idleness of a vertex \(v_{i}\in \mathcal {V}\) in time step t is defined as:

$$\begin{aligned} \overline{\mathcal {I}_{v_i}} (t) = \frac{ \overline{\mathcal {I}_{v_i}} (t_l)\cdot C_i + \mathcal {I}_{v_i}(t)}{C_i+1}, \end{aligned}$$
(2)

where \(C_i\) represents the number of visits to \(v_i\). Considering now \(\overline{\mathcal {I}_{\mathcal {V}}}\) as the set of the average idlenesses of all \(v_i \in \mathcal {V}\), given by:

$$\begin{aligned} \overline{\mathcal {I}_{\mathcal {V}}}(t) = \{ \overline{\mathcal {I}_{v_1}}(t), \ldots , \overline{\mathcal {I}_{v_i}}(t), \ldots , \overline{\mathcal {I}_{v_{|\mathcal {V}|}}}(t) \}, \end{aligned}$$
(3)

the maximum average idleness of all vertices \(\max (\overline{\mathcal {I}_{\mathcal {V}}})\) in time step t is defined as:

$$\begin{aligned} \max (\overline{\mathcal {I}_{\mathcal {V}}})(t) = max \{\overline{\mathcal {I}_{v_1}} (t),\ldots , \overline{\mathcal {I}_{v_i}} (t),\ldots ,\overline{\mathcal {I}_{v_{|\mathcal {V}|}}} (t)\}. \end{aligned}$$
(4)

For simplicity of notation, let us omit (t) whenever timing is not relevant. Finally, in order to obtain a generalized measure, the average idleness of the graph \(\mathcal {G}\) (\(\overline{\mathcal {I}_{\mathcal {G}}}\)) is defined as:

$$\begin{aligned} \overline{\mathcal {I}_{\mathcal {G}}} = \frac{1}{|\mathcal {V}|}~ \sum \limits _{i=1}^{|\mathcal {V}|}\overline{\mathcal {I}_{v_i}}, \end{aligned}$$
(5)

A similar assumption to other works in the literature (cf. Chevaleyre 2004; Almeida et al. 2004) is taken in the beginning of the experiments, where for all \(v_{i} \in \mathcal {V}\), \(\mathcal {I}_{v_i}(0) = 0\), as if every vertex had just been visited at the beginning of the mission. Consequently, there is a transitory phase in which the \(\overline{\mathcal {I}_{\mathcal {G}}}\) values tend to be low, not corresponding to the reality in steady-state, as will be seen in the results section. For this reason, the final \(\overline{\mathcal {I}_{\mathcal {G}}}\) value is measured only after convergence in the stable phase.

Considering a patrol path as an array of vertices of \(\mathcal {G}\), the multi-robot patrolling problem with an arbitrary team of R robots can be described as the problem of finding a set of R paths \({\mathbf x}\) that visit all vertices \(v_{i}\in \mathcal {V}\) of \(\mathcal {G}\), with the overall team goal of minimizing \(\overline{\mathcal {I}_{\mathcal {G}}}\):

$$\begin{aligned} f =\arg \!\min _{{\mathbf x}}(\overline{\mathcal {I}_{\mathcal {G}}}), \end{aligned}$$
(6)

By finding:

$$\begin{aligned} \mathbf {x} = \{x_1,\ldots , x_r,\ldots , x_R \}, \end{aligned}$$
(7)

Such that:

$$\begin{aligned}&x_{r} = \{v_{a}, v_{b}, \ldots \},\\&v_a, v_b, \ldots \in \mathcal {V},\nonumber \\&1 \le r \le R, R \in \mathbb {N},\nonumber \end{aligned}$$
(8)

Subject to:

$$\begin{aligned} \forall v_{i} \in \mathcal {V}, \exists x_{r} \in {\mathbf x}: v_{i} \in x_{r}. \end{aligned}$$
(9)

Note that \(x_{r}\) represents the patrolling path of robot r, which has an arbitrary dimension that depends on each robot’s decisions and \(v_a, v_b, \ldots \) are vertices that are successively connected in \(\mathcal {V}\). The union of all R paths should be equal to the full set of vertices \(\mathcal {V}\), and different paths \(x_r, x_s\) may have common vertices. In this work, the route \(x_{r}\) of each robot is computed online by each autonomous agents in order to adapt to the system’s needs.

4 Bayesian model for multi-robot patrolling

In a previous work of the authors (Portugal and Rocha 2013b), simple preliminary Bayesian-based techniques to tackle the MRPP were studied. Even though the results obtained were satisfactory, two main drawbacks were identified: a uniform prior distribution was adopted, assuming that all decisions were equiprobable; and the likelihood distributions were immutable, representing a fixed function of random variables.

In this work, robots are endowed with increased intelligence, since the previous Bayesian inspired models are extended with likelihood reward-based learning and continued prior update. More specifically, the model represents the decision of moving from one vertex of the graph to another. For \(\beta \) neighbors of the current vertex \(v_0\), where \(\beta =deg(v_0)\),Footnote 3 the model is applied \(\beta \) times. Each decision is considered independent and the agents have the ability to choose the action which has the greatest expectation of utility, weighted by the effects of all possible actions. Consequently, each robot’s patrol route is built progressively, at each decision step, adapting to the system’s needs, i.e. aiming at minimizing \(\overline{\mathcal {I}_{\mathcal {G}}}\). In the following subsections, more details on the Concurrent Bayesian Learning Strategy (CBLS) to solve the MRPP are presented, and the pseudo-code of the approach, running on each individual robot of the team, is presented in Algorithm 1.

figure c

4.1 Distribution modeling

As stated before, when reaching a vertex \(v_0\) of the navigation graph \(\mathcal {G}\), each robot is faced with a decision stage, where it must decide the direction it should travel next (cf. Fig. 1). To that end, two fundamental random variables are defined. The first one is boolean and simply represents the act of moving (or not) to a neighbor vertex:

$$\begin{aligned} move_i = \lbrace true, false \rbrace , \end{aligned}$$
(10)

while the second one is called arc strength \(\theta _{0,i}\), which represents the suitability of traveling to a neighbor \(v_i\) using the edge that connects \(v_0\) to \(v_i\):

$$\begin{aligned} \theta _{0,i} \in \mathbb {R}; \quad 0,i \in \mathbb {N}_0. \end{aligned}$$
(11)
Fig. 1
figure 1

Illustration of a patrolling decision instance. In this example, the robot may choose to visit one out of 4 neighbor vertices. Considering a generic neighbor \(v_x\), \(\mathcal {I}_{v_x}\) is the instantaneous idleness of \(v_x\); \(\overline{\mathcal {I}_{v_x}}\) is the average idleness of \(v_x\); \(e_{0,x}\) is the edge that connects the current vertex \(v_0\) and \(v_x\), with an edge weight \(|e_{0,x}|\); \(C_x\) gives the number of visits to \(v_x\) and \(\theta _{0,x}\) represents the arc strength of traveling from \(v_0\) to \(v_x\)

Note that in \(\mathcal {G}\), an edge \(e_{j,k}\) represents a connection from \(v_j\) to \(v_k\) and vice versa. An edge \(e_{j,k}\) has an edge cost or weight \(|e_{j,k}|=|e_{k,j}|\) given by the distance between the two vertices. Thus, by definition \(\mathcal {G}\) is undirected, since \(e_{j,k}=e_{k,j}\). Nevertheless, the term “arc” in this new variable is used intentionally, since it is affected by the direction of traveling. In a situation where an agent is at \(v_j\), it will look for the suitability of traveling to \(v_k\), given by \(\theta _{j,k}\). Under those circumstances, the suitability of traveling in the opposite direction is not relevant, and \(\theta _{j,k} \ne \theta _{k,j}\). As a consequence, the set of all \(\theta _{i,j}\) variables, denoted as \(\varvec{\theta }\), has a population of \(2|\mathcal {E}|\), where \(|\mathcal {E}|\) is the cardinality of the set of edges \(\mathcal {E}\) of \(\mathcal {G}\), and informally, higher values of arc strength lead to the edge being traversed more often in the specified direction. Note that this variable is chosen such that the collective patrolling strategy can adapt to any environment topology.

In this work, agents compute the degree of belief (i.e. a probability) of moving to a vertex \(v_i\), given the arc strengths, by applying Bayes rule:

$$\begin{aligned} P(move_i|\theta _{0,i}) = \frac{ P(move_i) P(\theta _{0,i}|move_i)}{ P(\theta _{0,i}) }. \end{aligned}$$
(12)

The posterior probability \(P(move_i|\theta _{0,i})\) is estimated via Bayesian inference from the prior \(P(move_i)\) and likelihood \(P(\theta _{0,i}|move_i)\) distributions. The denominator term is regarded as a normalization factor (Jansen and Nielsen 2007), being often omitted for the sake of simplicity.

The prior represents the belief obtained from analyzing past data. Naturally, in the MRPP, prior information about each vertex is encoded in the average idleness \(\overline{\mathcal {I}_{v_i}}\) of a vertex \(v_i\) given by (2). Therefore, \(P(move_i)\) is defined as:

$$\begin{aligned} P(move_i) = \frac{\overline{\mathcal {I}_{v_i}}}{ \sum \limits _{k=1}^{\beta }\overline{\mathcal {I}_{v_k}} }, \end{aligned}$$
(13)

thus decisions of moving to vertices with higher values of average idleness have intuitively higher probability. During the patrol mission, robots are continuously visiting new places and the \(\overline{\mathcal {I}_{V}}\) values change over time. Each agent computes these values internally by tracking its own visits to \(\mathcal {V}\) and receiving messages from other teammates when they arrive to a new vertex. In order to make an informed decision, at each decision step, the agent updates the prior information through (13), just before adopting (12) to obtain a degree of belief of moving to a neighbor vertex \(v_i\).

In addition to the prior distribution, it is also necessary to define the likelihood through a statistical distribution to model the arc strength \(\theta _{0,i}\). In the patrolling problem, agents must visit all \(v_i \in \mathcal {G}\), thus, theoretically, assigning a uniform value for every arc would not be unreasonable. However, in such a dynamic system, where the number of visits to different locations in the environment is permanently evolving, it is usually advantageous to avoid traversing certain edges at a given time and favoring the use of others, in order to improve performance. For example, it is highly undesirable that robots move along the same edges, as this is inefficient from the collective patrolling point of view. Also, moving in the same regions causes inter-robot interference (e.g. taking longer to reach their goals if they have to avoid collisions between them). Therefore, robots should coordinate themselves properly. Furthermore, task effectiveness is strongly related to the environment topology.

Hence, in the next subsection, a reward-based learning strategy to model and continually update the likelihood distribution is proposed in order to adapt to the system’s state according to previous decisions, having a high impact on the behavior of robots and aiming at optimizing the collective performance.

4.2 Multi-agent reward-based learning

In general, reward-based learning methods are quite attractive since agents are programmed through reward and punishments without explicitly specifying how the task is to be achieved (Panait and Luke 2005). In this work, Bayesian Learning is employed to estimate the likelihood functions. Being a cooperative multi-robot task with lack of centralized control, with decentralized and distributed information and asynchronous computation, multiple simultaneous learners (one per patrolling agent) are involved.

The concept of delayed reward with a 1-step horizon model is explored. Each agent chooses an action of moving from \(v_0\) to a neighbor \(v_i\), based on (12). After reaching \(v_i\), the information on its neighborhood has changed, namely the instantaneous idlenesses have been updated, i.e. \(\mathcal {I}_{v_i}(t) = 0\) and \(\mathcal {I}_{v_0}(t) > 0\). Through information observed after making the move, a reward-based mechanism is used to punish or benefit the arcs involved in the decision to move from \(v_0\) to \(v_i\). This influences upcoming moves starting in \(v_0\), by introducing a bias towards arcs which ought to be visited in the future.

Henceforth, the reward-based learning method is explained. When the robot decides which one of the \(\beta \) neighbor vertices of \(v_0\) is going to be visited next, each neighbor \(v_i\) will have an associated degree of belief given by the posterior probability. Therefore, it is possible to calculate the entropyFootnote 4:

$$\begin{aligned} H (move|\varvec{\theta }) = - \sum \limits _{i=1}^{\beta } P(move_i|\theta _{0,i})\,\log _2 (P(move_i|\theta _{0,i})), \end{aligned}$$
(14)

which measures the degree of uncertainty involved in the decision taken, being chosen for this reason as the basis for the punish/reward mechanism. The confidence on the decision taken is inversely proportional to the entropy H. Therefore, we use the entropy to quantify the rewards and penalties to assign. In this way, larger rewards and penalties are awarded to decisions with higher confidence (lower entropy). Note, however, that distinct \(v_i\) have different \(deg(v_i)\) and, as a result, \(\beta \) varies for each decision instant. Therefore, the entropy is normalized to assume values in [0, 1]:

$$\begin{aligned} \mathcal {H} (move|\varvec{\theta }) = \frac{H (move|\varvec{\theta })}{\log _2 (\beta ) } . \end{aligned}$$
(15)

After deciding and moving to a given \(v_k\), the robot computes rewards for each arc between \(v_0\) and its neighbor vertices \(v_i\) (including \(v_k\)) involved in the previous decision, using:

$$\begin{aligned} \gamma _{0,i} = \mathcal {S}_{i,0}(C_{i},\mathcal {I}_{\mathcal {V}}(t)) \cdot (1-\mathcal {H}(move|\varvec{\theta })), \end{aligned}$$
(16)

with:

$$\begin{aligned} \mathcal {S}_{i,0} \in \{-1,0,1\}. \end{aligned}$$
(17)

\(\mathcal {S}_{i,0}\) gives the reward sign, providing a quality assessment which determines whether a penalty (\(\mathcal {S}=-1\)), a reward (\(\mathcal {S}=1\)) or a neutral reward (\(\mathcal {S}=0\)) should be given. As can be seen, this piecewise function uses up-to-date information, namely the number of visits to \(v_i\), given by \(C_i\), and the set of current instantaneous idlenesses \(\mathcal {I}_{\mathcal {V}}(t)\). The sign of \(\mathcal {S}\) is obtained using the set of heuristic rules defined below, which are checked as soon as the agent reaches \(v_k\). For that matter, it is necessary to define firstly the normalized number of visits to any neighbor vertex \(v_i\):

$$\begin{aligned} \zeta _i = \frac{C_i}{\deg (v_i)}. \end{aligned}$$
(18)

This is used in the punish/reward procedure given that vertices with higher degree are naturally more visited than vertices with lower degree, being often traversed to reach isolated vertices that tend to have a lower number of visits. The rules for assigning the sign \(\mathcal {S}_{i,0}\) of the rewards are given by the following piecewise function definition:

$$\begin{aligned}&\mathcal {S}_{i,0}(C_{i},\mathcal {I}_{\mathcal {V}}(t))\nonumber \\&\quad =\left\{ \begin{array}{ll} -1, &{} \mathbf{if }~ (\beta > 1) \wedge (\displaystyle \mathtt{argmax}_{j\in N_\mathcal {G}(v_0)} \zeta _j = i) \, \wedge \\ &{} (|\displaystyle \mathtt{argmax}_{j\in N_\mathcal {G}(v_0)} \zeta _j| = 1) \\ \\ -1, &{} \mathbf{if }~ (\beta > 1) \wedge (\displaystyle \mathtt{argmax}_{j\in N_\mathcal {G}(v_0)} \zeta _j = i) \, \wedge \\ &{} ( \forall k,\, |\mathtt{argmax}_{j \in N_{\mathcal {G}}(v_0)} \zeta _j| = k \rightarrow \mathcal {I}_{v_i}(t) \le \mathcal {I}_{v_k}(t))\\ \\ 1, &{} \mathbf{if }~ (\beta > 1) \wedge (\displaystyle \mathtt{argmin}_{j\in N_ \mathcal {G}(v_0)} \zeta _j = i) \, \wedge \\ &{} (|\displaystyle \mathtt{argmin}_{j\in N_\mathcal {G}(v_0)} \zeta _j| = 1)\\ \\ 1, &{} \mathbf{if }~ (\beta > 1) \wedge (\displaystyle \mathtt{argmin}_{j\in N_\mathcal {G}(v_0)} \zeta _j = i) \, \wedge \\ &{} ( \forall k,\, |\mathtt{argmin}_{j \in N_{\mathcal {G}}(v_0)} \zeta _j| = k \rightarrow \mathcal {I}_{v_i}(t) \ge \mathcal {I}_{v_k}(t)) \\ \\ 0, &{} \mathbf{otherwise }. \\ \end{array}\right. \end{aligned}$$
(19)

with:

$$\begin{aligned} \beta = \deg (v_0) = |N_\mathcal {G}(v_0)|, \end{aligned}$$
(20)

\(N_\mathcal {G}(v_0)\) represents the open neighborhood of \(v_0\), i.e. the set of adjacent vertices of \(v_0\). As such, the assignment of the sign \(\mathcal {S}_{i,0}\) respects the following criteria:

  • \(\mathcal {S}_{i,0} = -1\), when the degree of \(v_0\) is higher than one (\(\beta >1\)) and the normalized number of visits to \(v_i\) (\(\zeta _i\)) is maximal in the neighborhood of \(v_0\). In case there is more than one vertex with maximal \(\zeta \), a negative reward is given to the one with lower instantaneous idleness \(\mathcal {I}_{v_j}(t)\) between those.

  • \(\mathcal {S}_{i,0} = 1\), when the degree of \(v_0\) is higher than one (\(\beta >1\)) and the normalized number of visits to \(v_i\) (\(\zeta _i\)) is minimal in the neighborhood of \(v_0\). In case there is more than one vertex with minimal \(\zeta \), a positive reward is given to the one with higher instantaneous idleness \(\mathcal {I}_{v_j}(t)\) between those.

  • \(\mathcal {S}_{i,0} = 0\), in every other situation that differs from the above.

These rules guarantee that when there is more than one vertex involved in the decision, strictly one reward and one penalty are assigned.

In the beginning of the mission, when \(t=t_0\), all arcs strength \(\theta _{0,i}\) are equal to a real positive number \(\kappa \):

$$\begin{aligned} \forall \theta _{0,i} \in \theta , \, \theta _{0,i}(t_0) = \kappa . \end{aligned}$$
(21)

As the mission evolves, the agent updates \(\theta _{0,i}\) every time it visits a new vertex through:

$$\begin{aligned} \theta _{0,i}(t) = \theta _{0,i} (t-1) + \gamma _{0,i} (t). \end{aligned}$$
(22)

Note that the larger the value of \(\kappa \) is set in (21), the less immediate influence the rewards received will have on \(\theta _{0,i}\). In all experimental tests conducted in this work, \(\kappa =1.0\) was used. This reward-based procedure is expected to make the values of \(\theta _{0,i}\) fluctuate as time goes by, informing robots of moves which are potentially more effective, but keeping in mind that robots must visit all vertices \(v_i\) in the patrolling mission.

Finally, the learnt likelihood distribution is obtained through normalization of \(\theta _{0,i}\):

$$\begin{aligned} P(\theta _{0,i}|move_i) = \frac{\theta _{0,i}}{\sum \limits _{j}^{|\mathcal {E}|}\sum \limits _{k}^{|\mathcal {E}|} \theta _{j,k}}, \end{aligned}$$
(23)

being updated at each decision step and making use of experience acquired in the past for future decisions.

4.3 Decision-making and multi-agent coordination

Having described how agents learn their likelihood distribution, it is necessary to address agent coordination to completely characterize the CBLS solution for patrolling tasks presented in this article. Being a concurrent learning approach, each agent is adapting its behavior via its own learning process and has no control or knowledge of how other agents behave nor their internal state, i.e. they do not know their teammates’ likelihood distribution \(P(\theta _{0,i}|move_i)\) and do not predict their teammates’ moves. This allows the reduction of complexity of the problem, however it is necessary to guarantee the coordination of robots.

In collective missions with a common goal, multi-agent coordination plays a fundamental role in the success of the mission. Particularly in this context, it is highly undesirable that agents move to the same positions. Thereby, the asynchronous and distributed communication system that is used to inform teammates of the current vertex \(v_0\) is augmented with the information of the vertex \(v_i\) chosen for the next move.

Therefore, by exchanging simple messages with teammates that only contain the robot ID r, the current vertex \(v_0\), and the next chosen vertex \(v_i\), each robot can update the information about the state of the system, namely the idleness values, and decide its moves taking that information into account, as well as its progressively acquired experience. When agents are close by, they can coordinate by inspecting if a teammate has already expressed intention to move to a given vertex \(v_i\) in its local neighborhood and if so, remove it from its decision.

Finally, the decision-making process of the agent consists of choosing the move from \(v_0\) to the neighbor vertex \(v_j\) with the maximum probability among all possible decisions:

$$\begin{aligned} move_j = true : j= \underset{i \in N_{\mathcal {G}}(v_0)}{\displaystyle \mathtt{argmax}}\,P(move_i|\theta _{0,i}) \end{aligned}$$
(24)

4.4 Incorporating vertex look-ahead

Up to now, we have only mentioned agents that decide their moves upon knowledge of the local 1-step neighborhood. While it is true that expanding the search horizon increases the complexity of the problem, it is also true that it may lead to a higher number of correct decisions, since agents deliberate with additional information.

Therefore, we also propose to study a variation of the model presented before, by incorporating vertex look-ahead. Note however that, in this context, each agent is an independent concurrent learner and does not exchange its local beliefs about the system, only its current vertex and the intended vertex to visit in its immediate neighborhood. As a consequence, if agents plan too far ahead, their intentions may come into conflict, i.e. they may be planning their moves to inadvertently reach the same sites, which would be highly inefficient from the standpoint of the patrolling mission. The probability of such conflicts increases with the number of steps in the planning horizon. For this reason, and to manage the complexity of the system, we explore the concept of 2-step look-ahead, aiming to further improve the performance of the CBLS approach.

Fig. 2
figure 2

Illustration of the 2-step look-ahead in the navigation graph

Prior to making a decision, agents will now look not only for the average idleness of the vertices \(v_i\) in its immediate neighborhood, but also for the neighbors of these (\(v_j, v_k,\ldots , v_{|N_\mathcal {G}(v_i)|}\)) to find the maximum average idleness among them, as shown in Fig. 2.

Accordingly, vertex look-ahead is incorporated in the prior distribution, by modifying (13):

$$\begin{aligned} P(move_i) = \frac{\omega \cdot \overline{\mathcal {I}_{v_i}}+(1-\omega )\cdot \max (\overline{\mathcal {I}_{N_\mathcal {G}(v_i)}})}{ \sum \limits _{k=1}^{|\mathcal {V}|}\omega \cdot \overline{\mathcal {I}_{v_k}} + (1-\omega )\cdot \max (\overline{\mathcal {I}_{N_\mathcal {G}(v_k)}})}, \end{aligned}$$
(25)

where:

$$\begin{aligned}&\omega \in [0,1], \end{aligned}$$
(26)
$$\begin{aligned}&\max (\overline{\mathcal {I}_{N_\mathcal {G}(v_i)}}) = \max \{ \overline{\mathcal {I}_{v_j}}, \overline{\mathcal {I}_{v_k}},\ldots , \overline{\mathcal {I}_{v|N_\mathcal {G}(v_i)|}} \}, \end{aligned}$$
(27)
$$\begin{aligned}&v_j, v_k,\ldots , v_{|N_\mathcal {G}(v_i)|} \in N_\mathcal {G}(v_i). \end{aligned}$$
(28)

The factor \(\omega \) assigns weights to the observations on the immediate neighborhood and on the 2-step neighborhood. When \(\omega =1.0\), no look-ahead is considered and (25) is equivalent to (13). The lower the \(\omega \) value is, the higher the weight of observations beyond the 1-step neighborhood are. However, intuitively we should set \(\omega \) for values above 0.5 (and below 1.0), because the information on the immediate 1-step neighborhood is always reliable and up-to-date when the decision is made, while the information on the 2-step neighborhood may change when the robot reaches \(v_i\). Note also, that when reaching \(v_i\), the robot has another decision instance and in rare cases may deliberate contrarily to the initial plan, in case the settings in its surroundings have changed in the meantime. The influence of the \(\omega \) factor is analyzed in the preliminary simulation results presented in the next section, so as to establish an appropriate value to use in experiments with physical robots.

5 Results and discussion

In this section, experimental results to assess the performance of CBLS are presented and discussed. Section 5.1 reveals the outcome of simulations experiments in three environments with distinct graph connectivity, enabling to analyze the effect of \(\omega \) in the mission and comparing our approach with other state of the art multi-robot patrolling strategies.

In the following subsection, experimental results with teams of physical robots in a large indoor scenario are discussed. CBLS is compared to the TSP cyclic algorithm, which finds the shortest closed walk on a graph by solving the TSP (Fazli et al. 2013), in non-complete graphs as the one presented in the experiments. This benchmarking algorithm is optimal for the single robot case and near optimal for multi-robot scenarios, being perfectly suited for a comparative analysis.

5.1 Simulation results

In the preliminary simulation experiments, the main goal is to study the effect of the \(\omega \) parameter in the vertex look-ahead method and to enable comparisons with other strategies in the literature. To that end, the three environments illustrated in Fig. 3 have been used to test the approach with different team sizes of \(R = \{1, 2, 4, 6, 8, 12\}\) robots. The three illustrated topologies present different algebraic connectivity or Fiedler value \(\lambda \) (Fiedler 1973), a well-known metric of the connectivity of a graph given by the smallest non-zero eigenvalue of the graph’s Normalized Laplacian matrix. These topologies were used in Portugal and Rocha (2013a), where they were classified as: lowly (A), mildly (B) and highly (C) connected, having a Fiedler value of \(\lambda _A=0.0080\), \(\lambda _B=0.0317\) and \(\lambda _C=0.1313\), respectively. In this work, these are again adopted to enable comparative analysis against other MRPP strategies. Note, for example, that the graph of environment A (\(\mathcal {G}_A\)) has several dead-ends, i.e. vertices with degree 1. On the other hand \(\mathcal {G}_C\) is the most connected of the three, with several vertices with degree 4.

Fig. 3
figure 3

Environments used in the experiments with respective topological map

While collecting results in different scenarios, the same simulation setup was used for all strategies.

A recognized simulator with realistic modeling was chosen: the Stage 2D multi-robot simulator (Vaughan 2008). Stage considers the robot’s dynamics and together with ROS (Quigley et al. 2009), adopted to program the robots, was used to implement the experiments.

The graph information of the environment is loaded by every robot in the beginning of each simulation, which then runs the described algorithm. Robots navigate safely in the environment by heading towards their goals while avoiding collisions through the use of ROS navigation stack and an adaptive Monte Carlo localization approach. Additionally, all robots have non-holonomic constraints and move at the same nominal speed, traveling at a maximum velocity of 0.2 m/s.

All the experiments conducted in this article were repeated three times for each setting, and they all respect the same stopping condition determined by 4 complete patrolling cycles i.e. after every \(v_i \in \mathcal {G}\) has been visited at least four times. This stopping condition proved to be adequate since \(\overline{\mathcal {I}_\mathcal {G}}\) always converged, i.e. all results were obtained in the steady-state phase of the patrolling mission, as shown by the marginal mean difference in \(\overline{\mathcal {I}_\mathcal {G}}\) values between the third and fourth cycle of all experiments, which was around 2.5 %.

Tables 1, 2 and 3 present the mean performance results of the distributed patrolling strategy CBLS, described in this article, given by \(\overline{\mathcal {I}_\mathcal {G}}\) in seconds, and using the environments of Fig. 3 with \(\omega =1.0\), \(\omega ={3}/{4}\) and \(\omega ={2}/{3}\). The same results are presented in more detail for each individual case in Tables 13, 14, 15, 16, 17, 18, 19, 20, 21, in Appendix. Results prove the intuition that superior performance can be obtained with \(0.5 < \omega < 1.0\), seeing as, in general, the best results were obtained for \(\omega ={3}/{4}\). It is also interesting to verify that the results without look-ahead (\(\omega =1.0\)) are always inferior to those obtained with vertex look-ahead. This confirms that looking further beyond the local neighborhood has the potential to increase the number of correct decisions of each agent and improve team performance. Being a complex problem, variables such as graph connectivity and team size have an important impact on the performance of every patrolling strategy. Thus, it was plainly anticipated that a certain fixed value of \(\omega \) would not be better for all configurations. Nevertheless, in the experiments conducted, \(\omega ={3}/{4}\) was superior in 14 configurations out of 18. In view of this, we shall consider \(\omega ={3}/{4}\) in the experiments with physical robots.

Table 1 Final \(\overline{\mathcal {I}_{\mathcal {G}}}\) values (in seconds) using CBLS in environment A with different \(\omega \)
Table 2 Final \(\overline{\mathcal {I}_{\mathcal {G}}}\) values (in seconds) using CBLS in Environment B with different \(\omega \)
Table 3 Final \(\overline{\mathcal {I}_{\mathcal {G}}}\) values (in seconds) using CBLS in Environment C with different \(\omega \)
Table 4 Final \(\overline{\mathcal {I}_{\mathcal {G}}}\) values (in seconds) using different state of the art strategies on Environment A
Table 5 Final \(\overline{\mathcal {I}_{\mathcal {G}}}\) values (in seconds) using different state of the art strategies on Environment B
Table 6 Final \(\overline{\mathcal {I}_{\mathcal {G}}}\) values (in seconds) using different state of the art strategies on Environment C

Using findings from our previous works, Tables 45 and 6 were built. Therein, tests in the same three environments and benchmarking with several state of the art patrolling approaches were conducted, and we were able to compare the performance of CBLS against 7 other state of the art approaches,Footnote 5 using the \(\overline{\mathcal {I}_\mathcal {G}}\) metric. For each different strategy, all the robots started in the same positions when executing a patrolling mission in a given environment and with a given team size. This was carried out to promote a fair comparison between all strategies. For further details on the various strategies and tests previously conducted, the interested reader should refer to Portugal and Rocha (2013a, b).

Table 7 Cases where the \(\mathcal {I}_\mathcal {G}\) values of CBLS algorithm were statistically significant different to the other algorithms (bold), and not statistically significant different (italic)
Fig. 4
figure 4

Evolution of the likelihood distribution in a mission with 2 robots in Environment A, with \(\omega =0.75\). a robot 1 after 200 decisions; b robot 1 at the end of the mission; c robot 2 after 200 decisions; d robot 2 at the end of the mission. The red line represents the initial distribution when the mission started (Color figure online)

In general, CBLS clearly outperforms the other 7 state of the art approaches. This is the case even when no look-ahead is considered. The difference is apparent especially for lowly (environment A) and mildly (environment B) connected environments. In the highly connected grid environment (map C), vertices have, in general, greater degree and there are usually many alternative routes to reach a given goal. This fact makes other state of the art strategies perform better in this case than in less connected (more typical) environments. Nevertheless, CBLS presents a strong performance for smaller team sizes of 1, 2 and 4 robots; and outperforms the other strategies for larger team sizes of 6, 8 and 12 robots, which suggests that CBLS scales better than the rest of the strategies. In fact, CBLS is superior to all other approaches in 83.33 % of the configurations tested. Additionally, these results prove that CBLS is able to adapt to all kinds of environment topologies independently of team size.

A factorial Analysis of Variance (ANOVA) was conducted aiming to examine the main effects and interactions between environments (environment A, B and C), team size (1, 2, 4, 6, 8 and 12 robots) and algorithm (CR, HCR, HPCC, CGG, MSP, GBS, SEBS and CBLS) on the performance value given by \(\mathcal {I}_\mathcal {G}\). A confidence level of 95 % was used to identify significant differences between factor levels. Prior analysis revealed that there were no significant outliers, i.e. no major data stood below the first quartile or above the third quartile in the dataset, and the data were normally distributed across all groups, as revealed by the Shapiro-Wilk’s test (\(p<0.05\)).

The analysis revealed that there is a main effect of the environment on the idleness values (\(F(2, 426)=12308.678\), \(p<0.001\), partial \(\eta ^2=0.989\)), given that these values are significantly higher in environment A when compared to environment B and C. Furthermore, the team size has a main effect on the idleness value (\(F(5,426)=27051.872\), \(p<0.001\), partial \(\eta ^2=0.998\)), since the idleness values decreased as more robots were used. More importantly, there was also a clear effect of the algorithm towards the idleness values (\(F(7,426)=244.655\), \(p<0.001\), partial \(\eta ^2=0.858\)), which explains the differences obtained with each of the patrolling strategy tested. Pairwise comparisons between all algorithms across all combinations of team size and environment revealed that in the case of the CBLS algorithm, the idleness values were significantly lower (\(p<0.001\)) than all the other algorithms, except specific cases in Environment B and C, as shown in Table 7. In specific cases, especially in Environment C, the differences were not significant. However, descriptive statistics still show that the idleness values in CBLS was lower in most of these cases.

Moreover, there was a statistically significant interaction effect between team size and algorithm towards the idleness values (\(F(14,426)=43151.575\), \(p<0.001\), partial \(\eta ^2=0.827\)). As expected, pairwise comparisons between team size groups revealed that in all algorithms the idleness value significantly decreased as the team size increased. There was also a statistically significant interaction effect between the environment and algorithm towards the idleness values (\(F(35,426)=13.665\), \(p<0.001\), partial \(\eta ^2=0.627\)). Pairwise comparisons between environments revealed that in all algorithms the idleness value significantly decreased in environment C compared to environment A.

Fig. 5
figure 5

Overall results running CBLS with \(\omega ={3}/{4}\) and different team sizes

Looking now at the CBLS results in more detail, the evolution of the likelihood function of a patrolling mission with two robots in environment A and \(\omega ={3}/{4}\) is illustrated in Fig. 4. Note that each robot apprehends a different distribution depending on its acquired experience, and has no control or knowledge on the internal state of its teammates. As expected, peaks in the histograms emerge with the increasing number of decisions. Despite that, it is also clear that values fluctuate around the initial uniform value (represented by the red line in each chart), which comes as a consequence of robots having to visit every vertex \(v_i \in \mathcal {G}\).

Fig. 6
figure 6

Topological map of the “ISR-Floor0” environment

Moving on to the performance of the algorithm, the boxplot charts in Fig. 5 represent the \(\overline{\mathcal {I}_\mathcal {V}}\) values, in seconds, for each tested team size on all three maps. The average value is represented by a black cross, providing a generalized measure: the average graph idleness, \(\overline{\mathcal {I}_\mathcal {G}}\) (cf. Eq. 5). The ends of the blue boxes and the horizontal red line in between correspond to the first and third quartiles and the median values of \(\overline{\mathcal {I}_\mathcal {V}}\), respectively.

As expected, the idleness values decrease when the number of robots grow. Despite the increasing performance displayed by the CBLS approach, the individual contribution of adding more robots gradually reduces with team size. Group productivity will eventually converge with a large R. In theory, productivity should grow during size scale-up; however, spatial limitations increase the number of times the robots meet and beyond a given R, it is argued that they will spend more time avoiding each other than effectively patrolling on their own.

Another interesting aspect illustrated in the boxplots of Fig. 5 is the fact that the median \(\widetilde{\mathcal {I}_\mathcal {G}}\) is close to the mean \(\overline{\mathcal {I}_\mathcal {G}}\) in all configurations, being usually lower. This means that the \(\overline{\mathcal {I}_\mathcal {V}}\) values are positively skewed, i.e. most of the values are below the average, \(\overline{\mathcal {I}_\mathcal {G}}\), and as a consequence, most outliers are above the third quartile.

On a more general note, visual inspection of the trajectories of robots using CBLS showed that the patrolling routes change over time so as to adapt to the system’s needs, making it difficult to predict the robot’s behavior, as opposed to most strategies presented in Portugal and Rocha (2013a, b). This is in fact one of the key advantages of using a distributed online approach, as the system will continuously adapt even to simple events that occur during the course of the mission, e.g. taking longer for a robot to reach a vertex due to an unexpected obstacle, eventual navigation recovery behaviors that delay visits to the following vertices, robots traveling at different speeds (cf. Sect. 5.2.2), robot failures (cf. Sect. 5.2.3), etc. The paths of each patrolling unit will continuously change, as robots have no physical restriction to move within the whole navigation graph, in order to maintain a minimum graph idleness according to the model presented.

This phenomenon, together with the promising results obtained, proves the effectiveness of the approach and the potential to be applied in actual security systems with physical teams of robots.

5.2 Experiments with physical robots

Multi-Robot Patrol is mainly a practical problem and, in order for distributed intelligence systems to be useful in the real world, it is essential to go beyond simulation experiments and validate convincing solutions that prove the reliability of the proposed strategy in more demanding scenarios. In this section, the implementation of a system for multi-robot patrol in a large indoor environment is presented. Fully autonomous agents decide locally and sequentially their patrol routes according to the state of the system, as previously described, validating the CBLS distributed approach. Beyond the coordination which arises from the distributed communication of agents, it is also shown that the approach is scalable, robust to robot failures, i.e. fault-tolerant, and supports heterogeneous agents with different speed profiles.

Fig. 7
figure 7

Robots used in the experiments with the CBLS algorithm at ISR

Experiments were performed in a large indoor scenario, namely the floor 0 of the Institute of System and Robotics (ISR), in the University of Coimbra. Figure 6 shows the extracted topological map on top of the 67.85 \(\times \) 26.15 m environment, which was obtained using the algorithm in Portugal and Rocha (2013c). The resulting topology is a non-complete, connected and sparse graph, like most real world environments.

When conducting experiments in the real world, one must overcome noisy sensor readings, localization issues and even robot failures, which are usually ignored or not precisely modeled in simulation experiments. Therefore, a team of six Pioneer-3DX robots (Robotics 2006), equipped with an Hokuyo URG-04LX-UG01 laser with 5.6 m of maximum scanning range, and a laptop on top was used, as seen in Fig. 7. Each laptop runs the ROS navigation stack using the Adaptive Monte Carlo (AMCL) algorithm for Localization as done previously in Stage simulations (cf. Sect. 5.1), being responsible for controlling the robot’s motion, which reaches speeds of up to 1 m/s. As for communication, a distributed publish/subscribe mechanism has been used, due to its built-in integration in ROS. Moreover, each robot runs its own ROS master node (roscore), and multimaster communication is provided using the wifi_comm Footnote 6 package. This means that there is no central point of failure in the system. While communication using a mobile ad hoc network deployed by the robot themselves would be possible, in these experiments we leveraged the existing WiFi infrastructure in the building.

We examine not only the average graph idleness along time, \(\overline{\mathcal {I}_{\mathcal {G}}}\), but also the median \(\widetilde{\mathcal {I}_{\mathcal {G}}}\), standard deviation \(\sigma \), and the maximum average idleness of a vertex along time, \(\max (\overline{\mathcal {I}_{\mathcal {V}}})\). In the beginning of each test, the graph of the environment is loaded by every robot. A ROS node (i.e. a ROS application) is responsible for advertising the start of the mission and collecting results during the experiments. These results are examined in the next sections. Note that this “monitor” node does not centralize the approach nor does it give feedback to the robots whatsoever. In fact, it is merely used in this work to allow the extraction of experimental results.

Table 8 Experiments with 1 to 6 robots using CBLS (all values in seconds)
Table 9 Experiments with 1 to 6 robots using TSP cycle (all values in seconds)
Fig. 8
figure 8

Robots positions during an experiment with 6 robots using CBLS. Implicit formation of dynamic regions

Fig. 9
figure 9

Evolution of the absolute reward values along three experiments with different team size

Firstly, we conducted experiments with teams from one to six robots using the proposed strategy and comparing it with the TSP Cyclic Strategy. After this comparative analysis, CBLS was tested by adding one, and then, two agents to the patrolling mission with different speed profile, in order to prove that the approach adapts well even when heterogeneous teams of robots are used. In addition, to prove its robustness, experiments which included failures in the robots at different time instants were analyzed. Finally, we carried out additional simulation experiments under the same conditions of the experiments with physical robots conducted in order to evaluate how realistic the simulation framework used throughout this work is. Whenever CBLS is adopted in these experiments, \(\omega = {3}/{4}\) is considered. The total estimated distance traveled by the robots during the course of all experiments was 50 kms (\(\simeq \) 31 miles).

5.2.1 Scalability and benchmarking

In this subsection, we focus on the scalability of the approach and analyze its performance when compared to the TSP Cyclic Strategy. As mentioned before, the latter is a recognized classical approach for multi-robot patrolling which finds the minimal tour that visits every vertex of the graph. This can be obtained using heuristic methods like the Chained Lin-Kernighan algorithm (Applegate et al. 2003). The approach is theoretically optimal for the single robot case, while in the multi-robot case, robots are equally spaced along the tour, and the approach provides a near optimal solution. Note also, that the deterministic route is computed a priori and offline, in contrast to our approach, where agents have the autonomy to decide their own moves online; and it assumes strictly homogeneous robots without supporting different speed profiles or robot failures, as opposed to CBLS. In the particular case of the environment used in the experiments (cf. Fig. 6), the TSP cycle tour is composed of the clear rectangular cycle pattern that exists in the map, with short detours to vertices with \(deg(v_i)=1\), namely \(v_i=\{0, 1, 13, 16, 25, 29, 30\}\).

Using the team of Pioneer-3DX robots, experiments with different team size from 1 to 6 robots were performed in the “ISR-Floor0” scenario. Each experiment was repeated 3 times and the results are presented in Tables 8 and 9.

An initial analysis shows that the performance values given by \(\overline{\mathcal {I}_{\mathcal {G}}}\) of both approaches are generally close in the configurations tested. As expected for the single robot case, the optimal TSP cycle outperforms CBLS (by \(\simeq 9.5\,\%\)). Nevertheless, considering that CBLS is running with a single learning agent limited to knowledge of its 2-step neighborhood, the result obtained is optimistic. Theoretically, having a larger horizon for the particular single robot case would improve CBLS results, as there would not be any other concurrent teammate learner in the system and, consequently, no interference with the agent’s long term plans would exist, enabling it to look further ahead without the risk of regretting its decisions.

Perhaps the most interesting aspect of these results is the performance attained by the proposed strategy with multi-robot teams. In the experiments conducted, the results obtained with CBLS were slightly better than TSP cycle, achieving differences in performance of up to \(\simeq 3.5\,\%\). However, one cannot say that CBLS is superior to TSP for MRS without conducting more tests. In general, both strategies perform similarly in the collected results, with TSP presenting even superior results of around \(\simeq 2.5\,\%\) in the 4 robots situation.

The tests in teams of growing number of robots show that the approach is able to scale well, performing similarly to TSP cycle in a near optimal way. This is remarkable, considering the distributed and non-deterministic nature of the approach, as opposed to TSP cycle. A careful look at the behavior of the robots shows us that they tend to create dynamic regions where each agent patrols more often. Robots do not have any physical boundaries to restrict them. However, the fact that our model takes into consideration the teammates’ decisions results in the robots avoiding the areas of their teammates. Thus, the patrolling strategy will tend to implicitly create these regions, especially in lowly connected environments, as clearly shown in Fig. 8. As a result, there is little interference between agents.Footnote 7 In addition, since robots only share their current and future immediate goals, the bandwidth requirements are negligible even with larger teams.

Table 10 Experiments with a team of 6 physical robots using CBLS extended with one and two virtual agents (all values in seconds)
Fig. 10
figure 10

Overview of the results with teams of physical robots and mixed teams of real and virtual robots, using CBLS

Table 11 Experiments with a team of 6 physical robots with one and two failures during the mission (all values in seconds)

It is clear in Table 9 that by making the robots follow the same global route, TSP cycle always presents lower values of standard deviation, \(\sigma \), when compared to CBLS, promoting more uniform visits to vertices. This is also suggested by the maximum average idleness of the vertices, \(\max (\overline{\mathcal {I}_{\mathcal {V}}})\), which are smaller than in CBLS.

Another interesting aspect observed in the experiments is the median value, \(\widetilde{\mathcal {I}_{\mathcal {G}}}\), being usually lower than the mean \(\overline{\mathcal {I}_{\mathcal {G}}}\) for CBLS, especially with greater team size, thus confirming the simulation results. As opposed, for TSP cycle, all \(\widetilde{\mathcal {I}_{\mathcal {G}}}\) values are higher than \(\overline{\mathcal {I}_{\mathcal {G}}}\), meaning that the distribution is negatively skewed and most of the values are above the average.

It is also worth referring another important disadvantage of TSP in general, which is the fact that the computation of a TSP cycle may become intensive or even prohibitive in large graphs.

Looking more closely at CBLS in the experiments with physical robots, a descending trend is shown by the absolute reward values, given by the \((1-\mathcal {H})\) factor in (16), along the experiments with different configurations. Figure 9 illustrates how these values evolve in missions with three different team sizes. Despite the occasional peaks, such values tend to decrease with the number of decisions. This is because, in general, as the system progresses, the \(\overline{\mathcal {I}_\mathcal {V}}\) values of different vertices become more balanced and, as a consequence, the degree of belief in moving to distinct neighbors comes closer. In such situations, the closer the posterior probabilities are, the higher the entropy becomes, therefore the reward values descend gradually. The peaks observed are justified by situations where agents share nearby areas, temporarily perturbing the \(\overline{\mathcal {I}_\mathcal {V}}\) values in the neighborhood of other agents. For that reason, peaks are more observable in larger teams.

5.2.2 Heterogeneous teams

In the previous section, the number of robots R was limited to the physical robots available. However, the distributed patrolling method used supports an arbitrary high team size. In this section, the aim is to further explore the scalability of the approach and test it with teams of heterogeneous robots.

Fig. 11
figure 11

Evolution of the idleness along time using CBLS with teams of six physical robots with failures. a Failure at 300 s. b Failure at 150 s. c Failure at 150 and 250 s. d Without failure (for reference)

Being a distributed strategy composed of concurrent learning agents, the team should not only adapt to the system’s state but also to different robot profiles. Hence, virtual agents, running in the stage simulator, were added to the physical team, resulting in a mixed and interacting team of real and simulated robots, which communicate seamlessly. It is noteworthy that adding virtual simulated agents to the physical teams of robots was only made possible by the hardware abstraction layer of ROS and its modular structure.

Virtual agents have the same properties of the ones used in the simulation experiments. Thus, they travel slower than the physical robots. Three trials were conducted with a total of 7 agents composed by 6 physical robots and 1 simulated robot; and three more trials were performed with a team size of 8, composed by 6 physical robots and 2 simulated ones. Similarly to (Iocchi et al. 2011), the software layer is used unchanged both on real robots and in simulation.

Results in Table 10 show that the overall values of \(\overline{\mathcal {I}_{\mathcal {G}}}\), \(\max (\overline{\mathcal {I}_{\mathcal {V}}})\), \(\widetilde{\mathcal {I}_{\mathcal {G}}}\) and \(\sigma \) are within the expected, following the trend shown in Table 8. The additional virtual agents are integrated into the remaining team and are able to interact with teammates. In addition, their contribution to the global performance is minor, as expected. This happens not only because of the progressive decrease of individual contribution of each robot as team size grows, but also due to lower speed at which these robots travel. The boxplot of Fig. 10 illustrates this trend.

It is noteworthy that incorporating robots that travel at different speeds with strategies that solve the MRPP with predefined routes, such as TSP cycle, would not be suitable because maintaining a uniform distance between each robot would not be possible unless all robots were limited to travel at the speed of the slowest robot.

5.2.3 Robustness/fault-tolerance

One of the main advantages of providing the patrol robots with means for deciding their moves in the environment is the absence of a centralized coordinator, which would represent a critical point of failure. A distributed autonomous robotic system, such as the herein presented, enables redundancy, remaining functional if some of the agents fail.

To demonstrate the robustness of the approach, three experiments using the Pioneer 3-DX robots were planned. In these experiments a robot is shut down at different instants of time, aiming at studying the effect of the failures in the overall performance, as well as how the system evolves.

In all three tests, the team starts with six robots. In the first test, a robot is shut down 300 s into the experiment. Similarly, in the second experiment, a robot is shut down after 150 s. Finally, in the third experiment, one robot is shut down after 150 s and a second robot is shut down after 250 s. The other robots assume that a teammate has failed when no message has been received from it within 2 min.

Table 11 indicates that the results obtained in the first two experiments resemble those obtained with five robots. In these experiments, the average idleness values converged after the failure occurs. Figure 11 shows the evolution of the average vertex idleness in the three experiments. For reference, the evolution of the average vertex idleness without failure was also added (cf. Fig. 11d). In the first experiment, since approximately half of the mission was spent with six robots, the final performance was slightly better and the overall mission time \(\tau \) was shorter than in the second experience, i.e. it took longer in the second test to fulfill the mission because the failure occurred earlier.

Table 12 Additional simulation experiments with 1 to 6 robots using CBLS in the “ISR-Floor0” environment (all values in seconds)

The two failures in the third experiment greatly influenced the final performance results in the test. However, it is also clear that extra time would be needed in order to converge to higher values, as the final average graph idleness \(\overline{\mathcal {I}_{\mathcal {G}}}\) was much lower than the 4 robot case. Nevertheless, one can verify that in all three cases, when the failure occurs, the values of \(\overline{\mathcal {I}_{\mathcal {G}}}\) and \(\widetilde{\mathcal {I}_{\mathcal {G}}}\) tend to increase after a while.

The evidences taken from these results show the robustness of the system, proving that it enables graceful degradation, as long as at least one robot remains operational.

Discussing now the hypothetical event of communication failures, when a message is not received by a robot, it does not update the instantaneous idleness time values and, consequently, it maintains incomplete information about the state of the system. This information becomes more incomplete with the increasing number of undelivered messages. Additionally, when robots are close to each other, if messages are not received, they may decide to patrol nearby locations and interfere with their teammates’ plans. The success of resolving such situations hugely depends on each robot’s local planner and the ability to avoid dynamic obstacles. In our work, this is taken care of by the ROS navigation stack. Therefore, we expect CBLS to be robust to communication failures and only slightly degrade its performance when the communication errors rate is moderate.

If the robots cannot communicate at all, the performance of the algorithm will drop strongly. This will occur especially in larger teams, which are much more influenced by the lack of coordination in the multi-robot system, as robots will constantly interfere with one another. In this case, robots will act greedily, will not share their intentions and will often compete for the same vertices. Performance is expected to gracefully degrade if only communication with neighbors is allowed. In this situation, robots will be able to coordinate themselves by not competing to the same goals and not interfering with teammates. Despite that, they will not have contact with agents that are further away and as a consequence they will make uninformed decisions quite often. We have conducted a similar analysis in our past work (Portugal and Rocha 2013b), where we employed an identical distributed communication protocol, despite using a different patrol strategy. Therein, we were able to verify the robustness of the multi-robot patrolling algorithm to communication failures.

Fig. 12
figure 12

A Simulation in the “ISR-Floor0” environment with a team of 6 robots

5.2.4 Additional simulations in the environment used in the real world experiments

In this section, we discuss additional simulations by employing the CBLS approach described in this paper with varying team sizes in the same map used for the real world experiments. Our goal here is to verify the realism of the simulator using the same software layer that was used in the multi-robot experiments.

Similarly to Sect. 5.2.1, we have deployed teams of robots with \(R=\{1,2,3,4,5,6\}\) in the “ISR-Floor0” environment and under the same conditions described therein. However, this time we have used the stage multi-robot simulator instead of the real robots hardware. Once again, we have run three trials for each configuration. The results obtained are depicted in Table 12.

Comparing Table 12 with Table 8, it can be seen that the results drawn from simulation tests are very close to those obtained with the real robots. This resemblance translates into an average difference of 1.55 % in \(\overline{\mathcal {I}_{\mathcal {G}}}\) values, 5.49 % in \(\max (\overline{\mathcal {I}_{\mathcal {V}}})\) values, 2.75 % in \(\widetilde{\mathcal {I}_{\mathcal {G}}}\) values, and 4.05 % in \(\sigma \) values. These values are remarkably low, which suggest that these simulations can be used with a high degree of confidence to estimate the actual results with physical robots.

Nevertheless, the results extracted from the additional simulations tend to be marginally lower to those obtained with the real robots, and the variance between each different trial within the same configuration is inferior. Thus, it is our belief that multi-robot simulations in stage can provide an accurate yet slightly optimistic approximation of real world results, and the lower variance is possibly linked to real phenomena that are not fully modeled in simulations such as wheel slip, robot assembly properties or delays in processing sensor data and producing actuator commands.

Figure 12 illustrates a snapshot of a simulation with 6 virtual pioneer robots in the environment. The results reported demonstrate how realistic the simulations are in the realm of multi-robot applications such as patrolling.

6 Conclusions

In this work, cooperative multi-agent learning has been addressed in order to solve the patrolling problem in a distributed way. Each robot decides its local patrolling moves online, without requiring any central planner. Decision-making is based upon Bayesian reasoning on the state of the system, considering the history of visits and teammates’ actions, so as to promote effective coordination in the behavior of patrolling agents. Concurrent reward-based learning has been adopted given that, in this domain, the decomposition of the problem reduces the complexity of the general mission by distributing computational load among each independent learner.

Experimental results have shown that the method is able to effectively tackle the problem, since it can deal with uncertainty and the actions are selected according not only to prior knowledge about the problem, but also the state of the system at the time, resulting in adaptive, effective and distributed cooperative patrolling. Moreover, it was shown that our Bayesian update method can be applied in different types of environments, independent of their topology, withstands failures in robotic patrol units, and accomplishes exceeding performance. In fact, when compared with other state of the art approaches through simulations, CBLS generally outperforms them independently of team size, and in the real world experiments conducted, the approach was able to obtain near optimal results, which is particularly remarkable given the limited search space considered and distributed nature of the approach. Thus, this proves the potential of the proposed multi-robot patrolling strategy for real world applications.

In the future, it would be interesting to relax the assumption of perfect communication, testing the performance of CBLS under communication failures or allowing only local interactions between robots within a certain range. In fact, we would expect to obtain similar results as those tested in Portugal and Rocha (2013b), which showed that distributed patrolling approaches are robust to communication failures, performance degrades with increased ratio of failures, and limited communication range allows the team to coordinate and not compete to visit the same vertices of the graph, although keeping incomplete information about the state of the system. Finally, addressing the issue of scalability, we intend to work on estimation methods for dimensioning the size of the team of robots in a given patrolling task according to the environment topology and temporal constraints.