Cooperative multi-robot patrol with Bayesian learning

Portugal, David; Rocha, Rui P.

doi:10.1007/s10514-015-9503-7

Cooperative multi-robot patrol with Bayesian learning

Published: 13 October 2015

Volume 40, pages 929–953, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Autonomous Robots Aims and scope Submit manuscript

Cooperative multi-robot patrol with Bayesian learning

Download PDF

2319 Accesses
46 Citations
1 Altmetric
Explore all metrics

Abstract

Patrolling indoor infrastructures with a team of cooperative mobile robots is a challenging task, which requires effective multi-agent coordination. Deterministic patrol circuits for multiple mobile robots have become popular due to their exceeding performance. However their predefined nature does not allow the system to react to changes in the system’s conditions or adapt to unexpected situations such as robot failures, thus requiring recovery behaviors in such cases. In this article, a probabilistic multi-robot patrolling strategy is proposed. A team of concurrent learning agents adapt their moves to the state of the system at the time, using Bayesian decision rules and distributed intelligence. When patrolling a given site, each agent evaluates the context and adopts a reward-based learning technique that influences future moves. Extensive results obtained in simulation and real world experiments in a large indoor environment show the potential of the approach, presenting superior results to several state of the art strategies.

Cooperative Multi-robot Patrol in an Indoor Infrastructure

Multi-agent Rapidly-exploring Pseudo-random Tree

Article 07 March 2017

Autonomous Strategy Determination with Learning of Environments in Multi-agent Continuous Cleaning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In the past several years, evident advances in mobile robotics have occurred and roboticists have increasingly turned to Artificial Intelligence (AI) techniques to endow robots with perception, reasoning, planning and learning capabilities. Even though single agent solutions are still one step ahead of general multi-agent solutions, with the robustness inherited by distributed AI, multi-agent systems have been increasingly proposed, providing tools for the development of complex systems and mechanisms for coordination of the behavior of independent agents (Stone and Veloso 2000).

Considerable scientific work presented recently span across the boundaries of Robotics and AI, and in the particular case of multi-robot systems (MRS), these works are usually verified through multi-agent simulations or controlled test scenarios. However, recent attempts to employ robot teams in real world environments have become evident, e.g. as described in the works of Iocchi et al. (2011), Pippin et al. (2013), Agmon et al. (2012), and more. Successful examples of solutions that have proliferated in public places include automated guided vehicles (AGVs)^{Footnote 1} and the Santander Interactive Guest Assistants (SIGA),^{Footnote 2} and teams of multiple robots have been increasingly used in military and security applications, taking advantage of space distribution, parallelism, task decomposition and redundancy. In this context, there have been several advances in multi-robot patrolling and coverage, map learning, graph-exploration and networked robots (Parker 2008).

Security applications are a fundamental task with unquestionable impact on society. Combining progress witnessed at the behavior-level with the technological evolution observed in the last decades, it becomes clear that robot assistance can be a valuable resource in surveillance missions. In this article, we propose a distributed multi-robot solution for indoor patrol. We believe that employing teams of robots for active surveillance tasks has several advantages over, for instance, a camera-based passive surveillance system. Robots are mobile and have the ability to travel in the field, collect environmental samples, act or trigger remote alarm systems and inspect places that can be hard for static cameras to capture. These capabilities free human operators from executing tedious, repetitive or monotonous tasks, e.g. mine clearing or patrolling in hazardous environments, enabling them to be occupied in nobler tasks like monitoring the system from a safer location (Murphy 2004) and taking advantage of robots’ expendability.

The proposed multi-robot patrolling solution is based on Bayesian reasoning and learning agents, which can adapt to the dynamics of the system. We firstly conduct simulation experiments and then validate our results in a large real world environment. We show that the approach can be applied in different types of environment, independent of its topology; withstands failures in robotic patrol units; can be performed with heterogeneous robots; and accomplishes exceeding performance. Therefore, its potential is shown as a solution for real world MRS.

In the next section, a literature review is conducted and the contributions of the article are described. Afterwards, the multi-robot patrolling problem is defined and the performance metric is presented. Section 4 describes the distributed multi-robot patrolling strategy proposed in this article. Section 5 presents the results obtained both through simulations and hardware experiments, as well as a discussion of the results obtained. Finally, the article ends with conclusions and open issues for future research.

2 Related work

To patrol is literally defined as “the activity of going around or through an area at regular intervals for security purposes” (Vocabulary.com 2015). The multi-robot patrolling problem (MRPP) requires coordination of agents’ decision-making with the ultimate goal of achieving optimal group performance, while visiting every position in the environment (or at least those that need surveillance). Additionally, it aims at monitoring infrastructures, obtaining information, searching for objects, detecting anomalies and clearing areas in order to guard the grounds from intrusion. Consequently, performing a patrolling mission with a team of any given number of autonomous and cooperative robots distributed in space represents a complex challenge. We propose to solve the area patrolling problem with a team of arbitrarily many robots, focusing on intelligent strategies for coordination of the team in order to visit all the surveillance points that need vigillance inside a target area, as effectively as possible. Thus in this context, we consider the problem of border/perimeter patrol (Elmaliach et al. 2008; Marino et al. 2013), and adversarial patrol (Basilico et al. 2009; Agmon et al. 2011a) out of the scope of this work.

Despite its high potential utility in security applications, only recently has the MRPP been rigorously addressed. The environment to patrol is commonly abstracted through a navigation graph (cf. Fabrizi and Saffiotti 2000) and numerous works explore graph theory tools like spanning trees Fazli et al. (2013), Gabriely and Rimon (2001) or graph partitioning (Sak et al. 2008; Stranders et al. 2012) to compute minimal-cost cycles that assign efficient routes for each robot in the patrolling mission. Auctions and market-based coordination are also popular among multi-robot patrolling literature as in the case of Pippin et al. (2013) and Poulet et al. (2012), where agents bid to exchange vertices of the patrol graph to increase overall patrol performance. Numerous other concepts have already been explored, such as task allocation (Sempé and Drogoul 2003), artificial forces (Sampaio et al. 2010; Jansen and Sturtevant 2008), Gaussian processes theory (Marino et al. 2012), evolutionary algorithms (Aguirre and Taboada 2012), linear programming modeling (Keskin et al. 2012), and others.

Most contributions to the literature focus on presenting strategies to patrol any given environment with a team of mobile agents, the majority verified through simulations. Nevertheless, several important theoretical contributions to the MRPP have been presented and using the idleness criterion (cf. Sect. 3) it has been shown that the problem is NP-Hard (Chevaleyre 2004; Ruan et al. 2005). Based on a topological representation of the environment and using global/centralized information, it has been proved previously that optimal patrolling could eventually be obtained for the single robot case by solving the Traveling Salesman Problem (TSP) (Chevaleyre 2004), a classical computationally complex problem. As for the multi-agent situation, the problem is even more complex and optimal performance depends on the environment topology and the number of robots. However, the literature commonly reports that theoretically superior performance can be obtained either by optimal k-way graph partitioning, especially for high number of agents or graphs with unbalanced edges (Pasqualetti et al. 2012; Agmon et al. 2011b); or having all robots following the same TSP cycle, equally distributed in time and space, especially for low number of agents or balanced graphs (Chevaleyre 2004; Smith and Rus 2010). Both the TSP and the graph partitioning problem are NP-hard. Therefore solving these problems is non-trivial, particularly in sparse topologies, which is the case of most real world environments. For instance, TSP cycles may not even exist in such situations (recurrent in non-complete graphs) and the tour should be reduced to a minimum cost closed walk in the graph (Portugal et al. 2014). Additionally, these solutions lead to predefined trajectories for all robots. Commonly, these solutions cannot adapt to non-standard situations (e.g. robot failures or heterogeneous agents with different speed profiles) without specific recovery behaviors. Furthermore, such recovery behaviors need to be triggered by a centralized entity with global knowledge about the system. Another disadvantage is that deterministic patrols allow an external observer to easily apprehend the trajectory pattern. Thus, we turn to probabilistic strategies with distributed intelligence and autonomous decision-making robots. Besides being non-deterministic and much less predictable, these strategies do not have a central point of failure and can potentially scale to larger teams.

The creation of adaptive behaviors that allows agents to learn how to effectively patrol a given scenario have shown to be extremely promising. Moreover, such adaptability allows the agents to face changes in the system or in the scenario. Nevertheless, the use of such techniques is far from being straightforward. Certain works in this field have adopted machine learning methods. For instance, the pioneering approach of Santana et al. proposed to model the MRPP as a Q-learning problem in an attempt to enable automatic adaptation of the agents’ strategies to the environment (Santana et al. 2004). In brief, agents have a probability of choosing an action from a finite set of actions, having the goal of maximizing a long-term performance criterion, in this case, node idleness. Two reinforcement learning techniques, using different communication schemes were implemented and compared to non-adaptive architectures. Although not always scoring the best results, adaptive strategies were generally superior to other compared strategies in most of the simulation experiments reported. The main attractive characteristics in this work are distribution (no centralized coordination is assumed) and the adaptive behavior of agents, which is usually highly desirable in this domain. Lauri and Koukam (2014) built upon the work of Santana et al. by redefining the state space and reward function associated to each agent, and were able to improve the method in specific situations, depending on graph topology and number of agents. Similarly, Ishiwaka et al. (2003) proposed reinforcement learning to predict the location of teammates as well as the movement direction to a common target. In another surveillance application scenario, Ahmadi and Stone (2006) proposed an adaptive solution for area sweeping tasks based on decentralized partitioning and learning behaviors. This was verified in a small experimental arena with two AIBO robots endowed with cameras. However, as the robots had a high field of view, the focus was not primarily on planning exhaustive surveillance routes, but more on continuously moving the agents to positions where they were able to observe from distance.

Alternatively to reinforcement learning and adaptive behaviors some strategies have followed stochastic approaches that benefit from probabilistic decision-making to overcome the deterministic nature of classical patrolling applications. For instance, in Marier et al. (2010), the patrolling problem is cast as a multi-agent Markov decision process, where reactive and planning-based techniques are compared. The authors concluded that both perform similarly, with the latter being slightly superior in general, since it looks further ahead than the former, which is purely local. However, the reactive technique runs much faster, suggesting that a simple and computationally cheaper approach can be used in many applications, instead of more complex strategies which only perform slightly better. Similarly to reactive agents, swarm-based solutions to the patrolling problem have been studied in Yanovski et al. (2003), Baglietto et al. (2009) and Cannata and Sgorbissa (2011), where it is assumed that agents have limited computational power, communication abilities and memory storage. Interesting collective patrol behavior emerges using these type of approaches.

In this article, a new distributed and adaptive approach for multi-robot patrol is presented. Each robot decides its local patrolling moves online, without requiring any central planner. Decision-making is based upon Bayesian reasoning on the state of the system, considering the history of visits and teammates’ actions, so as to promote effective coordination between patrolling agents. Experimental results illustrate the advantages of using the proposed distributed technique, when compared to several state of the art strategies in simulation, and the well-known TSP solution in real world results.

2.1 Contributions

To summarize, the contributions of the work to the state of the art are as follows:

We propose a new probabilistic, distributed and scalable approach to solve the MRPP, whose effectiveness is attested in the experiments conducted;
We employ Bayesian decision-making together with adaptive learning in the context of the MRPP for the first time, as far as our knowledge goes, yielding intelligent patrolling behavior within the robot team;
We show the advantages of using the proposed algorithm by assessing its performance and effectiveness using realistic simulations in Stage/ROS (Quigley et al. 2009), and comparing these results with those reported in Portugal and Rocha (2013a) where several patrolling approaches were tested in the same conditions;
We implement the approach in a large indoor scenario with up to six physical robots, and compare it with the classical multi-agent TSP approach, which is theoretically optimal using one robot and nearly-optimal for generic teams of robots;
Fault-tolerance and scalable behaviors are verified in real world experiments, as well as the possibility to use the approach in heterogeneous teams of robots.

In the past, we have shown that naive Bayesian decision models have the potential to tackle the patrolling problem, by presenting methods based on conditional probability distributions and without agent learning behaviors (Portugal and Rocha 2013b). In this work, we intend to go much further by presenting a new strategy that rectifies weaknesses previously identified. We describe a probabilistic multi-robot patrolling strategy, where a team of concurrent learning agents adapt their moves to the state of the system at the time, using Bayesian decision rules based on the robot’s accumulated experience and distributed intelligence. When patrolling a given site, each agent evaluates the context and adopts a reward-based learning technique that influences future moves. Learning of the likelihood distributions involved in the robots’ decisions is done online, along the mission, and robots are constantly updating their prior information, i.e. their past experience.

This article is a culmination of two preliminary works (Portugal et al. 2013a, b), yet it clearly distinguishes itself from the previous publications by presenting an extended literature survey and a more complete Bayesian model, which includes vertex lookahead (cf. Sect. 4.4) in order to solve the MRPP using multi-agent concurrent reward-based learning. In addition, the implementation of a fully distributed system is described, carrying out completely new experiments both in simulation and in a real world indoor infrastructure and discussing in detail the implementation aspects. Also, unlike the two previous works, we assess through real world experiments the adaptability, scalability and fault-tolerant nature of the proposed approach.

3 Problem definition

In this article, the problem of efficiently patrolling inside a given area with an arbitrary number of robots is studied. Agents are assumed to have an a priori representation of the environment and through a graph extraction algorithm (Portugal and Rocha 2013c), they are able to acquire an undirected, connected and metric navigation graph $\mathcal {G}=(\mathcal {V},\mathcal {E})$, which enables them to assess the topology of its surroundings. $\mathcal {G}$ is composed of vertices $v_{i}\in \mathcal {V}$ and edges $e_{i,j}\in \mathcal {E}$. Each vertex represents a specific location that must be visited regularly and each edge represents the connectivity between these locations, having a weight $|e_{i,j}|$ defined by the metric distance between $v_i$ and $v_j$. $|\mathcal {V}|$ and $|\mathcal {E}|$ represent the cardinality of the set $\mathcal {V}$ and $\mathcal {E}$ respectively.

Informally, a good strategy is one that minimizes the time lag between two passages to the same place and for all places. Thus, the MRPP can be reduced to coordinate robots in order to visit frequently all $v_{i} \in \mathcal {G}$, ensuring the absence of atypical situations with regard to a predefined optimization criterion.

In order to address and compare the performance of different patrolling algorithms, it is important to establish an evaluation metric. Diverse criteria have been previously proposed to assess the effectiveness of multi-robot patrolling strategies. Typically, these are based on the idleness of the vertices (Almeida et al. 2004), the frequency of visits (Agmon et al. 2011b) or the distance traveled by agents (Iocchi et al. 2011). In this work, the first one has been considered, given that it measures the elapsed time since the last visit from any agent in the team to a specific location. The idleness metric uses time units, which is particularly intuitive, e.g. in the analysis of how long vertices have not been visited for, or comparing different patrolling strategies. In the following equations, we define important variables used in the remaining sections of the article.

The instantaneous idleness of a vertex $v_{i}\in \mathcal {V}$ in time step t is defined as:

$$\begin{aligned} \mathcal {I}_{v_i}(t) = t - t_l, \end{aligned}$$

(1)

wherein $t_l$ corresponds to the last time instant when the vertex $v_{i}$ was visited by any robot of the team. Consequently, the average idleness of a vertex $v_{i}\in \mathcal {V}$ in time step t is defined as:

$$\begin{aligned} \overline{\mathcal {I}_{v_i}} (t) = \frac{ \overline{\mathcal {I}_{v_i}} (t_l)\cdot C_i + \mathcal {I}_{v_i}(t)}{C_i+1}, \end{aligned}$$

(2)

where $C_i$ represents the number of visits to $v_i$. Considering now $\overline{\mathcal {I}_{\mathcal {V}}}$ as the set of the average idlenesses of all $v_i \in \mathcal {V}$, given by:

$$\begin{aligned} \overline{\mathcal {I}_{\mathcal {V}}}(t) = \{ \overline{\mathcal {I}_{v_1}}(t), \ldots , \overline{\mathcal {I}_{v_i}}(t), \ldots , \overline{\mathcal {I}_{v_{|\mathcal {V}|}}}(t) \}, \end{aligned}$$

(3)

the maximum average idleness of all vertices $\max (\overline{\mathcal {I}_{\mathcal {V}}})$ in time step t is defined as:

$$\begin{aligned} \max (\overline{\mathcal {I}_{\mathcal {V}}})(t) = max \{\overline{\mathcal {I}_{v_1}} (t),\ldots , \overline{\mathcal {I}_{v_i}} (t),\ldots ,\overline{\mathcal {I}_{v_{|\mathcal {V}|}}} (t)\}. \end{aligned}$$

(4)

For simplicity of notation, let us omit (t) whenever timing is not relevant. Finally, in order to obtain a generalized measure, the average idleness of the graph $\mathcal {G}$ ($\overline{\mathcal {I}_{\mathcal {G}}}$) is defined as:

$$\begin{aligned} \overline{\mathcal {I}_{\mathcal {G}}} = \frac{1}{|\mathcal {V}|}~ \sum \limits _{i=1}^{|\mathcal {V}|}\overline{\mathcal {I}_{v_i}}, \end{aligned}$$

(5)

A similar assumption to other works in the literature (cf. Chevaleyre 2004; Almeida et al. 2004) is taken in the beginning of the experiments, where for all $v_{i} \in \mathcal {V}$, $\mathcal {I}_{v_i}(0) = 0$, as if every vertex had just been visited at the beginning of the mission. Consequently, there is a transitory phase in which the $\overline{\mathcal {I}_{\mathcal {G}}}$ values tend to be low, not corresponding to the reality in steady-state, as will be seen in the results section. For this reason, the final $\overline{\mathcal {I}_{\mathcal {G}}}$ value is measured only after convergence in the stable phase.

Considering a patrol path as an array of vertices of $\mathcal {G}$, the multi-robot patrolling problem with an arbitrary team of R robots can be described as the problem of finding a set of R paths ${\mathbf x}$ that visit all vertices $v_{i}\in \mathcal {V}$ of $\mathcal {G}$, with the overall team goal of minimizing $\overline{\mathcal {I}_{\mathcal {G}}}$:

$$\begin{aligned} f =\arg \!\min _{{\mathbf x}}(\overline{\mathcal {I}_{\mathcal {G}}}), \end{aligned}$$

(6)

By finding:

$$\begin{aligned} \mathbf {x} = \{x_1,\ldots , x_r,\ldots , x_R \}, \end{aligned}$$

(7)

Such that:

$$\begin{aligned}&x_{r} = \{v_{a}, v_{b}, \ldots \},\\&v_a, v_b, \ldots \in \mathcal {V},\nonumber \\&1 \le r \le R, R \in \mathbb {N},\nonumber \end{aligned}$$

(8)

Subject to:

$$\begin{aligned} \forall v_{i} \in \mathcal {V}, \exists x_{r} \in {\mathbf x}: v_{i} \in x_{r}. \end{aligned}$$

(9)

Note that $x_{r}$ represents the patrolling path of robot r, which has an arbitrary dimension that depends on each robot’s decisions and $v_a, v_b, \ldots $ are vertices that are successively connected in $\mathcal {V}$. The union of all R paths should be equal to the full set of vertices $\mathcal {V}$, and different paths $x_r, x_s$ may have common vertices. In this work, the route $x_{r}$ of each robot is computed online by each autonomous agents in order to adapt to the system’s needs.

4 Bayesian model for multi-robot patrolling

In a previous work of the authors (Portugal and Rocha 2013b), simple preliminary Bayesian-based techniques to tackle the MRPP were studied. Even though the results obtained were satisfactory, two main drawbacks were identified: a uniform prior distribution was adopted, assuming that all decisions were equiprobable; and the likelihood distributions were immutable, representing a fixed function of random variables.

In this work, robots are endowed with increased intelligence, since the previous Bayesian inspired models are extended with likelihood reward-based learning and continued prior update. More specifically, the model represents the decision of moving from one vertex of the graph to another. For $\beta $ neighbors of the current vertex $v_0$, where $\beta =deg(v_0)$,^{Footnote 3} the model is applied $\beta $ times. Each decision is considered independent and the agents have the ability to choose the action which has the greatest expectation of utility, weighted by the effects of all possible actions. Consequently, each robot’s patrol route is built progressively, at each decision step, adapting to the system’s needs, i.e. aiming at minimizing $\overline{\mathcal {I}_{\mathcal {G}}}$. In the following subsections, more details on the Concurrent Bayesian Learning Strategy (CBLS) to solve the MRPP are presented, and the pseudo-code of the approach, running on each individual robot of the team, is presented in Algorithm 1.

4.1 Distribution modeling

As stated before, when reaching a vertex $v_0$ of the navigation graph $\mathcal {G}$, each robot is faced with a decision stage, where it must decide the direction it should travel next (cf. Fig. 1). To that end, two fundamental random variables are defined. The first one is boolean and simply represents the act of moving (or not) to a neighbor vertex:

$$\begin{aligned} move_i = \lbrace true, false \rbrace , \end{aligned}$$

(10)

while the second one is called arc strength $\theta _{0,i}$, which represents the suitability of traveling to a neighbor $v_i$ using the edge that connects $v_0$ to $v_i$:

$$\begin{aligned} \theta _{0,i} \in \mathbb {R}; \quad 0,i \in \mathbb {N}_0. \end{aligned}$$

(11)

Note that in $\mathcal {G}$, an edge $e_{j,k}$ represents a connection from $v_j$ to $v_k$ and vice versa. An edge $e_{j,k}$ has an edge cost or weight $|e_{j,k}|=|e_{k,j}|$ given by the distance between the two vertices. Thus, by definition $\mathcal {G}$ is undirected, since $e_{j,k}=e_{k,j}$. Nevertheless, the term “arc” in this new variable is used intentionally, since it is affected by the direction of traveling. In a situation where an agent is at $v_j$, it will look for the suitability of traveling to $v_k$, given by $\theta _{j,k}$. Under those circumstances, the suitability of traveling in the opposite direction is not relevant, and $\theta _{j,k} \ne \theta _{k,j}$. As a consequence, the set of all $\theta _{i,j}$ variables, denoted as $\varvec{\theta }$, has a population of $2|\mathcal {E}|$, where $|\mathcal {E}|$ is the cardinality of the set of edges $\mathcal {E}$ of $\mathcal {G}$, and informally, higher values of arc strength lead to the edge being traversed more often in the specified direction. Note that this variable is chosen such that the collective patrolling strategy can adapt to any environment topology.

In this work, agents compute the degree of belief (i.e. a probability) of moving to a vertex $v_i$, given the arc strengths, by applying Bayes rule:

$$\begin{aligned} P(move_i|\theta _{0,i}) = \frac{ P(move_i) P(\theta _{0,i}|move_i)}{ P(\theta _{0,i}) }. \end{aligned}$$

(12)

The posterior probability $P(move_i|\theta _{0,i})$ is estimated via Bayesian inference from the prior $P(move_i)$ and likelihood $P(\theta _{0,i}|move_i)$ distributions. The denominator term is regarded as a normalization factor (Jansen and Nielsen 2007), being often omitted for the sake of simplicity.

The prior represents the belief obtained from analyzing past data. Naturally, in the MRPP, prior information about each vertex is encoded in the average idleness $\overline{\mathcal {I}_{v_i}}$ of a vertex $v_i$ given by (2). Therefore, $P(move_i)$ is defined as:

$$\begin{aligned} P(move_i) = \frac{\overline{\mathcal {I}_{v_i}}}{ \sum \limits _{k=1}^{\beta }\overline{\mathcal {I}_{v_k}} }, \end{aligned}$$

(13)

thus decisions of moving to vertices with higher values of average idleness have intuitively higher probability. During the patrol mission, robots are continuously visiting new places and the $\overline{\mathcal {I}_{V}}$ values change over time. Each agent computes these values internally by tracking its own visits to $\mathcal {V}$ and receiving messages from other teammates when they arrive to a new vertex. In order to make an informed decision, at each decision step, the agent updates the prior information through (13), just before adopting (12) to obtain a degree of belief of moving to a neighbor vertex $v_i$.

In addition to the prior distribution, it is also necessary to define the likelihood through a statistical distribution to model the arc strength $\theta _{0,i}$. In the patrolling problem, agents must visit all $v_i \in \mathcal {G}$, thus, theoretically, assigning a uniform value for every arc would not be unreasonable. However, in such a dynamic system, where the number of visits to different locations in the environment is permanently evolving, it is usually advantageous to avoid traversing certain edges at a given time and favoring the use of others, in order to improve performance. For example, it is highly undesirable that robots move along the same edges, as this is inefficient from the collective patrolling point of view. Also, moving in the same regions causes inter-robot interference (e.g. taking longer to reach their goals if they have to avoid collisions between them). Therefore, robots should coordinate themselves properly. Furthermore, task effectiveness is strongly related to the environment topology.

Hence, in the next subsection, a reward-based learning strategy to model and continually update the likelihood distribution is proposed in order to adapt to the system’s state according to previous decisions, having a high impact on the behavior of robots and aiming at optimizing the collective performance.

4.2 Multi-agent reward-based learning

In general, reward-based learning methods are quite attractive since agents are programmed through reward and punishments without explicitly specifying how the task is to be achieved (Panait and Luke 2005). In this work, Bayesian Learning is employed to estimate the likelihood functions. Being a cooperative multi-robot task with lack of centralized control, with decentralized and distributed information and asynchronous computation, multiple simultaneous learners (one per patrolling agent) are involved.

The concept of delayed reward with a 1-step horizon model is explored. Each agent chooses an action of moving from $v_0$ to a neighbor $v_i$, based on (12). After reaching $v_i$, the information on its neighborhood has changed, namely the instantaneous idlenesses have been updated, i.e. $\mathcal {I}_{v_i}(t) = 0$ and $\mathcal {I}_{v_0}(t) > 0$. Through information observed after making the move, a reward-based mechanism is used to punish or benefit the arcs involved in the decision to move from $v_0$ to $v_i$. This influences upcoming moves starting in $v_0$, by introducing a bias towards arcs which ought to be visited in the future.

Henceforth, the reward-based learning method is explained. When the robot decides which one of the $\beta $ neighbor vertices of $v_0$ is going to be visited next, each neighbor $v_i$ will have an associated degree of belief given by the posterior probability. Therefore, it is possible to calculate the entropy^{Footnote 4}:

$$\begin{aligned} H (move|\varvec{\theta }) = - \sum \limits _{i=1}^{\beta } P(move_i|\theta _{0,i})\,\log _2 (P(move_i|\theta _{0,i})), \end{aligned}$$

(14)

which measures the degree of uncertainty involved in the decision taken, being chosen for this reason as the basis for the punish/reward mechanism. The confidence on the decision taken is inversely proportional to the entropy H. Therefore, we use the entropy to quantify the rewards and penalties to assign. In this way, larger rewards and penalties are awarded to decisions with higher confidence (lower entropy). Note, however, that distinct $v_i$ have different $deg(v_i)$ and, as a result, $\beta $ varies for each decision instant. Therefore, the entropy is normalized to assume values in [0, 1]:

$$\begin{aligned} \mathcal {H} (move|\varvec{\theta }) = \frac{H (move|\varvec{\theta })}{\log _2 (\beta ) } . \end{aligned}$$

(15)

After deciding and moving to a given $v_k$, the robot computes rewards for each arc between $v_0$ and its neighbor vertices $v_i$ (including $v_k$) involved in the previous decision, using:

$$\begin{aligned} \gamma _{0,i} = \mathcal {S}_{i,0}(C_{i},\mathcal {I}_{\mathcal {V}}(t)) \cdot (1-\mathcal {H}(move|\varvec{\theta })), \end{aligned}$$

(16)

with:

$$\begin{aligned} \mathcal {S}_{i,0} \in \{-1,0,1\}. \end{aligned}$$

(17)

$\mathcal {S}_{i,0}$ gives the reward sign, providing a quality assessment which determines whether a penalty ($\mathcal {S}=-1$), a reward ($\mathcal {S}=1$) or a neutral reward ($\mathcal {S}=0$) should be given. As can be seen, this piecewise function uses up-to-date information, namely the number of visits to $v_i$, given by $C_i$, and the set of current instantaneous idlenesses $\mathcal {I}_{\mathcal {V}}(t)$. The sign of $\mathcal {S}$ is obtained using the set of heuristic rules defined below, which are checked as soon as the agent reaches $v_k$. For that matter, it is necessary to define firstly the normalized number of visits to any neighbor vertex $v_i$:

$$\begin{aligned} \zeta _i = \frac{C_i}{\deg (v_i)}. \end{aligned}$$

(18)

This is used in the punish/reward procedure given that vertices with higher degree are naturally more visited than vertices with lower degree, being often traversed to reach isolated vertices that tend to have a lower number of visits. The rules for assigning the sign $\mathcal {S}_{i,0}$ of the rewards are given by the following piecewise function definition:

$$\begin{aligned}&\mathcal {S}_{i,0}(C_{i},\mathcal {I}_{\mathcal {V}}(t))\nonumber \\&\quad =\left\{ \begin{array}{ll} -1, &{} \mathbf{if }~ (\beta > 1) \wedge (\displaystyle \mathtt{argmax}_{j\in N_\mathcal {G}(v_0)} \zeta _j = i) \, \wedge \\ &{} (|\displaystyle \mathtt{argmax}_{j\in N_\mathcal {G}(v_0)} \zeta _j| = 1) \\ \\ -1, &{} \mathbf{if }~ (\beta > 1) \wedge (\displaystyle \mathtt{argmax}_{j\in N_\mathcal {G}(v_0)} \zeta _j = i) \, \wedge \\ &{} ( \forall k,\, |\mathtt{argmax}_{j \in N_{\mathcal {G}}(v_0)} \zeta _j| = k \rightarrow \mathcal {I}_{v_i}(t) \le \mathcal {I}_{v_k}(t))\\ \\ 1, &{} \mathbf{if }~ (\beta > 1) \wedge (\displaystyle \mathtt{argmin}_{j\in N_ \mathcal {G}(v_0)} \zeta _j = i) \, \wedge \\ &{} (|\displaystyle \mathtt{argmin}_{j\in N_\mathcal {G}(v_0)} \zeta _j| = 1)\\ \\ 1, &{} \mathbf{if }~ (\beta > 1) \wedge (\displaystyle \mathtt{argmin}_{j\in N_\mathcal {G}(v_0)} \zeta _j = i) \, \wedge \\ &{} ( \forall k,\, |\mathtt{argmin}_{j \in N_{\mathcal {G}}(v_0)} \zeta _j| = k \rightarrow \mathcal {I}_{v_i}(t) \ge \mathcal {I}_{v_k}(t)) \\ \\ 0, &{} \mathbf{otherwise }. \\ \end{array}\right. \end{aligned}$$

(19)

with:

$$\begin{aligned} \beta = \deg (v_0) = |N_\mathcal {G}(v_0)|, \end{aligned}$$

(20)

$N_\mathcal {G}(v_0)$ represents the open neighborhood of $v_0$, i.e. the set of adjacent vertices of $v_0$. As such, the assignment of the sign $\mathcal {S}_{i,0}$ respects the following criteria:

$\mathcal {S}_{i,0} = -1$, when the degree of $v_0$ is higher than one ($\beta >1$) and the normalized number of visits to $v_i$ ($\zeta _i$) is maximal in the neighborhood of $v_0$. In case there is more than one vertex with maximal $\zeta $, a negative reward is given to the one with lower instantaneous idleness $\mathcal {I}_{v_j}(t)$ between those.
$\mathcal {S}_{i,0} = 1$, when the degree of $v_0$ is higher than one ($\beta >1$) and the normalized number of visits to $v_i$ ($\zeta _i$) is minimal in the neighborhood of $v_0$. In case there is more than one vertex with minimal $\zeta $, a positive reward is given to the one with higher instantaneous idleness $\mathcal {I}_{v_j}(t)$ between those.
$\mathcal {S}_{i,0} = 0$, in every other situation that differs from the above.

These rules guarantee that when there is more than one vertex involved in the decision, strictly one reward and one penalty are assigned.

In the beginning of the mission, when $t=t_0$, all arcs strength $\theta _{0,i}$ are equal to a real positive number $\kappa $:

$$\begin{aligned} \forall \theta _{0,i} \in \theta , \, \theta _{0,i}(t_0) = \kappa . \end{aligned}$$

(21)

As the mission evolves, the agent updates $\theta _{0,i}$ every time it visits a new vertex through:

$$\begin{aligned} \theta _{0,i}(t) = \theta _{0,i} (t-1) + \gamma _{0,i} (t). \end{aligned}$$

(22)

Note that the larger the value of $\kappa $ is set in (21), the less immediate influence the rewards received will have on $\theta _{0,i}$. In all experimental tests conducted in this work, $\kappa =1.0$ was used. This reward-based procedure is expected to make the values of $\theta _{0,i}$ fluctuate as time goes by, informing robots of moves which are potentially more effective, but keeping in mind that robots must visit all vertices $v_i$ in the patrolling mission.

Finally, the learnt likelihood distribution is obtained through normalization of $\theta _{0,i}$:

$$\begin{aligned} P(\theta _{0,i}|move_i) = \frac{\theta _{0,i}}{\sum \limits _{j}^{|\mathcal {E}|}\sum \limits _{k}^{|\mathcal {E}|} \theta _{j,k}}, \end{aligned}$$

(23)

being updated at each decision step and making use of experience acquired in the past for future decisions.

4.3 Decision-making and multi-agent coordination

Having described how agents learn their likelihood distribution, it is necessary to address agent coordination to completely characterize the CBLS solution for patrolling tasks presented in this article. Being a concurrent learning approach, each agent is adapting its behavior via its own learning process and has no control or knowledge of how other agents behave nor their internal state, i.e. they do not know their teammates’ likelihood distribution $P(\theta _{0,i}|move_i)$ and do not predict their teammates’ moves. This allows the reduction of complexity of the problem, however it is necessary to guarantee the coordination of robots.

In collective missions with a common goal, multi-agent coordination plays a fundamental role in the success of the mission. Particularly in this context, it is highly undesirable that agents move to the same positions. Thereby, the asynchronous and distributed communication system that is used to inform teammates of the current vertex $v_0$ is augmented with the information of the vertex $v_i$ chosen for the next move.

Therefore, by exchanging simple messages with teammates that only contain the robot ID r, the current vertex $v_0$, and the next chosen vertex $v_i$, each robot can update the information about the state of the system, namely the idleness values, and decide its moves taking that information into account, as well as its progressively acquired experience. When agents are close by, they can coordinate by inspecting if a teammate has already expressed intention to move to a given vertex $v_i$ in its local neighborhood and if so, remove it from its decision.

Finally, the decision-making process of the agent consists of choosing the move from $v_0$ to the neighbor vertex $v_j$ with the maximum probability among all possible decisions:

$$\begin{aligned} move_j = true : j= \underset{i \in N_{\mathcal {G}}(v_0)}{\displaystyle \mathtt{argmax}}\,P(move_i|\theta _{0,i}) \end{aligned}$$

(24)

4.4 Incorporating vertex look-ahead

Up to now, we have only mentioned agents that decide their moves upon knowledge of the local 1-step neighborhood. While it is true that expanding the search horizon increases the complexity of the problem, it is also true that it may lead to a higher number of correct decisions, since agents deliberate with additional information.

Therefore, we also propose to study a variation of the model presented before, by incorporating vertex look-ahead. Note however that, in this context, each agent is an independent concurrent learner and does not exchange its local beliefs about the system, only its current vertex and the intended vertex to visit in its immediate neighborhood. As a consequence, if agents plan too far ahead, their intentions may come into conflict, i.e. they may be planning their moves to inadvertently reach the same sites, which would be highly inefficient from the standpoint of the patrolling mission. The probability of such conflicts increases with the number of steps in the planning horizon. For this reason, and to manage the complexity of the system, we explore the concept of 2-step look-ahead, aiming to further improve the performance of the CBLS approach.

Prior to making a decision, agents will now look not only for the average idleness of the vertices $v_i$ in its immediate neighborhood, but also for the neighbors of these ($v_j, v_k,\ldots , v_{|N_\mathcal {G}(v_i)|}$) to find the maximum average idleness among them, as shown in Fig. 2.

Accordingly, vertex look-ahead is incorporated in the prior distribution, by modifying (13):

$$\begin{aligned} P(move_i) = \frac{\omega \cdot \overline{\mathcal {I}_{v_i}}+(1-\omega )\cdot \max (\overline{\mathcal {I}_{N_\mathcal {G}(v_i)}})}{ \sum \limits _{k=1}^{|\mathcal {V}|}\omega \cdot \overline{\mathcal {I}_{v_k}} + (1-\omega )\cdot \max (\overline{\mathcal {I}_{N_\mathcal {G}(v_k)}})}, \end{aligned}$$

(25)

where:

$$\begin{aligned}&\omega \in [0,1], \end{aligned}$$

(26)

$$\begin{aligned}&\max (\overline{\mathcal {I}_{N_\mathcal {G}(v_i)}}) = \max \{ \overline{\mathcal {I}_{v_j}}, \overline{\mathcal {I}_{v_k}},\ldots , \overline{\mathcal {I}_{v|N_\mathcal {G}(v_i)|}} \}, \end{aligned}$$

(27)

$$\begin{aligned}&v_j, v_k,\ldots , v_{|N_\mathcal {G}(v_i)|} \in N_\mathcal {G}(v_i). \end{aligned}$$

(28)

The factor $\omega $ assigns weights to the observations on the immediate neighborhood and on the 2-step neighborhood. When $\omega =1.0$, no look-ahead is considered and (25) is equivalent to (13). The lower the $\omega $ value is, the higher the weight of observations beyond the 1-step neighborhood are. However, intuitively we should set $\omega $ for values above 0.5 (and below 1.0), because the information on the immediate 1-step neighborhood is always reliable and up-to-date when the decision is made, while the information on the 2-step neighborhood may change when the robot reaches $v_i$. Note also, that when reaching $v_i$, the robot has another decision instance and in rare cases may deliberate contrarily to the initial plan, in case the settings in its surroundings have changed in the meantime. The influence of the $\omega $ factor is analyzed in the preliminary simulation results presented in the next section, so as to establish an appropriate value to use in experiments with physical robots.

5 Results and discussion

In this section, experimental results to assess the performance of CBLS are presented and discussed. Section 5.1 reveals the outcome of simulations experiments in three environments with distinct graph connectivity, enabling to analyze the effect of $\omega $ in the mission and comparing our approach with other state of the art multi-robot patrolling strategies.

In the following subsection, experimental results with teams of physical robots in a large indoor scenario are discussed. CBLS is compared to the TSP cyclic algorithm, which finds the shortest closed walk on a graph by solving the TSP (Fazli et al. 2013), in non-complete graphs as the one presented in the experiments. This benchmarking algorithm is optimal for the single robot case and near optimal for multi-robot scenarios, being perfectly suited for a comparative analysis.

5.1 Simulation results

In the preliminary simulation experiments, the main goal is to study the effect of the $\omega $ parameter in the vertex look-ahead method and to enable comparisons with other strategies in the literature. To that end, the three environments illustrated in Fig. 3 have been used to test the approach with different team sizes of $R = \{1, 2, 4, 6, 8, 12\}$ robots. The three illustrated topologies present different algebraic connectivity or Fiedler value $\lambda $ (Fiedler 1973), a well-known metric of the connectivity of a graph given by the smallest non-zero eigenvalue of the graph’s Normalized Laplacian matrix. These topologies were used in Portugal and Rocha (2013a), where they were classified as: lowly (A), mildly (B) and highly (C) connected, having a Fiedler value of $\lambda _A=0.0080$, $\lambda _B=0.0317$ and $\lambda _C=0.1313$, respectively. In this work, these are again adopted to enable comparative analysis against other MRPP strategies. Note, for example, that the graph of environment A ($\mathcal {G}_A$) has several dead-ends, i.e. vertices with degree 1. On the other hand $\mathcal {G}_C$ is the most connected of the three, with several vertices with degree 4.

While collecting results in different scenarios, the same simulation setup was used for all strategies.

A recognized simulator with realistic modeling was chosen: the Stage 2D multi-robot simulator (Vaughan 2008). Stage considers the robot’s dynamics and together with ROS (Quigley et al. 2009), adopted to program the robots, was used to implement the experiments.

The graph information of the environment is loaded by every robot in the beginning of each simulation, which then runs the described algorithm. Robots navigate safely in the environment by heading towards their goals while avoiding collisions through the use of ROS navigation stack and an adaptive Monte Carlo localization approach. Additionally, all robots have non-holonomic constraints and move at the same nominal speed, traveling at a maximum velocity of 0.2 m/s.

All the experiments conducted in this article were repeated three times for each setting, and they all respect the same stopping condition determined by 4 complete patrolling cycles i.e. after every $v_i \in \mathcal {G}$ has been visited at least four times. This stopping condition proved to be adequate since $\overline{\mathcal {I}_\mathcal {G}}$ always converged, i.e. all results were obtained in the steady-state phase of the patrolling mission, as shown by the marginal mean difference in $\overline{\mathcal {I}_\mathcal {G}}$ values between the third and fourth cycle of all experiments, which was around 2.5 %.

Tables 1, 2 and 3 present the mean performance results of the distributed patrolling strategy CBLS, described in this article, given by $\overline{\mathcal {I}_\mathcal {G}}$ in seconds, and using the environments of Fig. 3 with $\omega =1.0$, $\omega ={3}/{4}$ and $\omega ={2}/{3}$. The same results are presented in more detail for each individual case in Tables 13, 14, 15, 16, 17, 18, 19, 20, 21, in Appendix. Results prove the intuition that superior performance can be obtained with $0.5 < \omega < 1.0$, seeing as, in general, the best results were obtained for $\omega ={3}/{4}$. It is also interesting to verify that the results without look-ahead ($\omega =1.0$) are always inferior to those obtained with vertex look-ahead. This confirms that looking further beyond the local neighborhood has the potential to increase the number of correct decisions of each agent and improve team performance. Being a complex problem, variables such as graph connectivity and team size have an important impact on the performance of every patrolling strategy. Thus, it was plainly anticipated that a certain fixed value of $\omega $ would not be better for all configurations. Nevertheless, in the experiments conducted, $\omega ={3}/{4}$ was superior in 14 configurations out of 18. In view of this, we shall consider $\omega ={3}/{4}$ in the experiments with physical robots.

Table 1 Final $\overline{\mathcal {I}_{\mathcal {G}}}$ values (in seconds) using CBLS in environment A with different $\omega $

Cooperative multi-robot patrol with Bayesian learning

Abstract

Similar content being viewed by others

Cooperative Multi-robot Patrol in an Indoor Infrastructure

Multi-agent Rapidly-exploring Pseudo-random Tree

Autonomous Strategy Determination with Learning of Environments in Multi-agent Continuous Cleaning

Explore related subjects

1 Introduction

2 Related work

2.1 Contributions

3 Problem definition

4 Bayesian model for multi-robot patrolling

4.1 Distribution modeling

4.2 Multi-agent reward-based learning

4.3 Decision-making and multi-agent coordination

4.4 Incorporating vertex look-ahead

5 Results and discussion

5.1 Simulation results

5.2 Experiments with physical robots

5.2.1 Scalability and benchmarking

5.2.2 Heterogeneous teams

5.2.3 Robustness/fault-tolerance

5.2.4 Additional simulations in the environment used in the real world experiments

6 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Extended details of the simulations with CBLS with different \(\omega \)

Appendix: Extended details of the simulations with CBLS with different \(\omega \)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation