1 Introduction

Supervised learning is one of the most central tasks in machine learning and pattern recognition. However, the latter task becomes intrinsically challenging whenever the data to be classified are not easily separable in the feature space. A myriad of classification algorithms have been proposed in the literature with a variety of behaviors and limitations [1,2,3]. Examples of these algorithms include neural networks, SVM and decision trees.

A broad class of classification algorithms such as SVM and perceptron relies upon defining a mathematical function with weights that can efficiently separate two or more classes of data. The weights are unknown and learned from the training data. These functions are either linear, polynomial, or for more complex patterns, kernels equivalent to mapping the data to a many-dimensional space where the classes are separable by a hyperplane.

However, the main difficulty is to choose the nature of the function or kernel. Often, the “best” hyperplane, or line in two dimensions that separates classes does not follow the mathematical properties of a function. The “best” separator can for example be a polygon encircling certain data points, which is not a function and therefore cannot straightforward be outputted by SVM or similar classifiers. The accuracy of the SVM is dependent on the right choice of the kernel function which is not an easy task given the unlimited number of available kernels.

Figure 1 shows an example of labeled data where it is not possible to perfectly separate the data with one function simply because any line separating the data perfectly will have multiple y-values of some of the x-values—which defies the definition of mathematical functions. SVM deals with this by projecting the data in high-dimensional space using the “kernel trick” where the data can be easily separable.

This paper introduces PolyLA, a novel classification scheme operating in two dimensionsFootnote 1 using LA and that does not involve a “kernel trick” whenever the data are not easily separable. As in [4], PolyLA deals with the classification problem in a completely different manner from existing classifiers. Instead of relying upon mathematical functions for separating the classes, PolyLA surrounds the classes with polygons guided by reinforced random walk and ray casting. Some of the best known classification techniques, such as support vector machine (SVM) and perceptron-based classifiers, rely upon constructing mathematical functions having weights that efficiently separate two or more classes of data in the feature space. In two-dimensional spaces, the separation boundary might be nonlinear and thus the decision boundaries might be complex. SVM deals with this situation by either projecting the data on a higher-dimensional space or using a kernel trick, which provides a separator not limited to a linear or polynomial function. The adoption of a kernel is equivalent to transposing the data to many dimensions, but the accuracy depends on the right choice of the kernel functions as well as on several other parameters. The latter choice is usually performed through manual trial and error. The presented approach deals with classification problems in two-dimensional Euclidean feature space by building “separator” with many-sided polygons. The polygons are extrapolated from reinforced random walks with a preference toward encapsulating all items from one class and excluding from the encapsulation any items from other classes. In this manner, emerging polygons encapsulates each class in such a way that they can be used as classifiers. The classification takes place by resorting to ray casting of unknown items so that to identify if an item is contained in the polygon. Each item is labeled depending on whether or not it is inside the polygon.

Fig. 1
figure 1

Example of simple two-class classification scenario with the classes blue (\(T_1\)) and red (\(T_2\)) (color figure online)

1.1 Outline

The paper is organized as follows. Section 2 introduces the problem that we attempt to solve. Section 3 gives a brief introduction to the theory of LA which is fundamental for our approach named PolyLA. Section 4 reviews relevant state-of-the-art in the area of classifiers as well as related LA-based classifiers. Section 5 continues with introducing our solution: PolyLA as a method for creating polygons for classification with two classes and corresponding results. Section 6 shows empirical results for PolyLA and compares it with comparable algorithms, namely SVM. Finally, in Sect. 7, we draw final conclusions and give insights into future work.

2 Problem formulation

Classification of unknown items based on labeled data is a supervised learning problem. In line with common practice, the problem is divided into two phases, namely (1) training and (2) classification:

  1. 1.

    Training phase: The aim of this phase is to create polygons that encircle classes of items so that the polygons separate the training classes from each other.

  2. 2.

    Classification phase: In this phase, we use the polygons as a basis to determine which class a new unknown item to be classified belongs to. This is achieved by finding which polygon(s) it is part of.

Further, this paper presents two distinct variants of PolyLA:

  • LA polygon classification for two-class classification problems.

  • LA polygon classification for multi-class classification problems.

2.1 The training phase

This training phase can be formulated as a combinatorial optimization problem. The training data, T, consist of multiple classes. The data are mapped to a two-dimensional Euclidean space as follows. A grid-like bidirectional planar graph G(VE) with vertices \(i \in V\) and edges \((i,j) \in E\) is created where \(i,j \in V\). All vertices have x- and y-coordinates and corresponding edges so that an edge (ij) represents the possibility to move from vertex i to j. The vertices in the graph are defined so that the first vertex, 1, always has lower x- and y-values than all the training data. Similarly, the last vertex, N, has x- and y-values larger than the training data. Hence, all the training data \(t_i \in T\) lie somewhere between vertices 1 and N, \(1<t_i<N \forall _{t_i \in T}\).

2.1.1 Two-class classification problem

An example is shown in Fig. 1. In this example, T consists of 19 items, 9 in the blue class \(T_1\) and 10 in the red class \(T_2\). The grid G(VE) is created so that all items are located in the grid.

To deduct, the main purpose of the training phase is to find a polygon, s, that encircles and separates the training classes. Using the example from Fig. 1, the task is to find an s that encircles the first training data \(T_1\), but not \(T_2\)—a polygon that separates well \(T_1\) from \(T_2\).

A polygon s is therefore a list of vertices and edges so that the first vertex in s is equal to the last vertex in s, and all vertices are connected together with corresponding edges. With two classes, there is only a need for one polygon to perfectly separate the data.

Whether a training element \(t_i\) is inside a polygon \(s \in \mathbf {S}\) is defined formally as:

$$\begin{aligned} \begin{array}{ll} h(t_i,s) = 1&{} \quad \hbox {if} \; t_i\; \hbox {is inside of} \; s\\ h(t_i,s) = 0 &{}\quad \hbox {otherwise} \end{array} \end{aligned}$$
(1)

Ideally, all items in class one (e.g., \(T_1\)) should be within the polygon, while all items in the other classes should fall outside the polygon. Any item, \(t_i\) from class \(T_1\) that is correctly within the polygon s will yield \(h(t_i,s)=1\), and, similarly, any item, \(t_j\) not part of class \(T_1\) and is correctly outside of the polygon s will yield \(1-h(t_j,s)=1\). For all other items, h(., s) will give 0. Further, let f(s) be a function that combines \(h(t_i,s)\) for all \(t_i \in T\) so that an ideal polygon that encapsulates all items in \(T_1\) and no other items will yield an \(f(s)=1\). An incorrect polygon, that in a flawed way encapsulates all other items than \(T_2\) and none in \(T_1\), will yield an \(f(s)=0\).Footnote 2

The overall aim of the training phase can therefore be stated as to find a polygon \(s*\) for each class, consisting of vertices and edges, that minimizes \(f(s*)\). Thus, formally, we aim to find an \(s* \in \mathbf {S}\) so that \(f(s*)\le f(s)\in \mathbf {S}\) using an LA-based random walk on the grid as explained in Sect. 3.

2.1.2 Multi-class classification problem

In the case of classification with more than two classes, one polygon is not sufficient to separate all classes. As an example, let us suppose there are three classes: \(T_1\),\(T_2\), and \(T_3\). In simple term, we need a classifier that identifies an item as belonging to \(T_1\), one to \(T_2\), and one to \(T_3\). This is done by finding one polygon that separates \(T_1\) from the rest, and so on.

The output from the training phase is therefore a list of classifiers rather than one single \(s*\). Following the same example with three classes, we have one classifier that decides whether an item is part of \(T_1\), \(s*_{T_1}\), and one that decides whether an item is part of \(T_2\), \(s*_{T_2}\). If it is neither part of \(T_1\) nor \(T_2\), it naturally belongs to \(T_3\). Hence, the number of classifier is one less than the number of classes.

For N classes, we get the following \(N-1\) classifiers:

$$\begin{aligned} \mathbf {s*_{all}}=\{s*_{T_1},s*_{T_2},\dots ,s*_{T_{N-1}}\} \end{aligned}$$
(2)

2.2 The classification phase

The classification phase resorts to the polygons from the training phase. The classification task is to find which class a new item with unknown label, \(t_k\), belongs to.

Since the training phase produces one polygon, \(s*\), the problem is reduced to simply determining whether a new item is within or outside \(s*\). The problem can be stated as follows: given the polygon \(s*\) and a new item with unknown label, \(t_k\), which class does \(t_k\) belong to? Using the update function from Eq. 1, given two classes \(T_1\) and \(T_2\) and the polygon \(s*\), we can define the following decision rules:

$$\begin{aligned} \begin{array}{ll} t_k \; \hbox {is of class} \; T_1 &{}\hbox {if}\; h(t_k,s*)=1 \\ t_k \; \hbox {is of class} \; T_2 &{} \hbox {if} \; h(t_k,s*)=0 \end{array} \end{aligned}$$
(3)

2.2.1 Multi-class classification problem

The classification phase uses the set of polygons, \(\mathbf {s*_{all}}\) (see Eq. 2), from the training phase. The task is to classify an unlabeled item \(t_k\). The following decision rules are used in the case of multi-class classification:

$$\begin{aligned} \begin{array}{ll} t_k \; \hbox {is of class} \; T_1 &{} \quad \hbox {if} \; h(t_k,s*_{T_1})=1\\ t_k \; \hbox {is of class}\; T_2 &{} \quad \hbox {if} \; h(t_k,s*_{T_2})=1\\ \dots &{} \quad .\\ t_k\; \hbox {is of class} \; T_{N-1} &{} \quad \hbox {if}\; h(t_k,s*_{T_{N-1}})=1\\ t_k\; \hbox {is of class}\; T_{N} &{} \quad \hbox {otherwise} \end{array} . \end{aligned}$$
(4)

In simple terms, the above classification rules mean simply that if the item to be classified is part of the first polygon \(s*_{T_1}\), it should be classified as the label corresponding to the first polygon, \(T_1\). Otherwise, if it is part of \(s*_{T_2}\), it should be classified as \(T_2\), and so on. However, if the item is not part of any of the polygons of the \(T_{N-1}\) classes, it will be labeled as the class \(T_N\).

2.3 Multi-dimensional classification

It is possible to extend PolyLA to support multiple features by splitting a multi-feature classification problem into several two-dimensional sub-problems which are trained independently. The overall classification is a combination of the results from all sub-problems through a majority voting scheme. More precisely, the overall class prediction is derived by taking the most common class prediction from all the sub-problems, as illustrated in Fig. 2 using the majority vote rule.

Fig. 2
figure 2

Overview of training and classification for PolyLA with several features

In this sense, PolyLA constructs solutions in all the planes that the data set consists of and handles each plane individually. The number of possible planes depends on the number of features in the data set. For example, a three-dimensional feature space with axes xy and z has three planes xyxz and yz (see Fig. 2). More generally, the number of planes for n dimensional feature space is simply equal to the number of dimension pairs and is given by: \(\left( {\begin{array}{c}n\\ 2\end{array}}\right)\). Inevitably, the number of planes explodes as the number of features increases. However, feature selection and reduction methods could be used to deal with this problem.

3 Learning Automata

The fundamental tool which we shall use in most of our research involves Learning Automata (LA). LA have been used in systems that have incomplete knowledge about the Environment in which they operate [5,6,7,8,9,10,11]. The learning mechanism attempts to learn from a stochastic Teacher which models the Environment. In his pioneering work, Tsetlin [12] attempted to use LA to model biological learning. In general, a random action is selected based on a probability vector, and these action probabilities are updated based on the observation of the Environment’s response, after which the procedure is repeated.

The term “Learning Automata” was first publicized and rendered popular in the survey paper by Narendra and Thathachar. The goal of LA is to “determine the optimal action out of a set of allowable actions” [5].

With regard to applications, the entire field of LA and stochastic learning has had a myriad of applications [6,7,8, 10, 11], which (apart from the many applications listed in these books) include solutions for problems in network and communications [13,14,15,16], network call admission, traffic control, quality of service routing, [17,18,19], distributed scheduling [20], training hidden Markov models [21], neural network adaptation [22], intelligent vehicle control [23] and even fairly theoretical problems such as graph partitioning [24]. Besides these fairly generic applications, with a little insight, LA can be used to assist in solving (by, indeed, learning the associated parameters) the stochastic resonance problem [25], the stochastic sampling problem in computer graphics [26], the problem of determining roads in aerial images by using geometric-stochastic models [27] and various location problems [28]. Similar learning solutions can also be used to analyze the stochastic properties of the random waypoint mobility model in wireless communication networks [29], to achieve spatial point pattern analysis codes for GISs [30], to digitally simulate wind field velocities [31], to interrogate the experimental measurements of global dynamics in magneto-mechanical oscillators [32], and to analyze spatial point patterns [33]. LA-based schemes have already been utilized to learn the best parameters for neural networks [22], optimizing QoS routing [19], and bus arbitration [14]—to mention a few other applications.

In the field of Automata Theory, an automaton [6,7,8, 10, 11] is defined as a quintuple composed of a set of states, a set of outputs or actions, an input, a function that maps the current state and input to the next state, and a function that maps a current state (and input) into the current output.

Definition 1

A LA is defined by a quintuple \(\langle {A, B, Q, F(.,.), G(.)} \rangle\), where:

  1. 1.

    \(A=\{ {\alpha }_1, {\alpha }_2, \ldots , {\alpha }_r\}\) is the set of outputs or actions that the LA must choose from, and \({\alpha }(t)\) is the action chosen by the automaton at any instant t.

  2. 2.

    \(B = \{{\beta }_1, {\beta }_2, \ldots , {\beta }_m\}\) is the set of inputs to the automaton. \({\beta }(t)\) is the input at any instant t. The set B can be finite or infinite. The most common LA input is \(B = \{0, 1\}\), where \(\beta = 0\) represents reward, and \(\beta = 1\) represents penalty.

  3. 3.

    \(Q=\{q_1, q_2, \ldots , q_s\}\) is the set of finite states, where Q(t) denotes the state of the automaton at any instant t.

  4. 4.

    \(F(.,.): Q \times B \mapsto Q\) is a mapping in terms of the state and input at the instant t, such that, \(q(t+1)=F[q(t), {\beta }(t)]\). It is called a transition function, i.e., a function that determines the state of the automaton at any subsequent time instant \(t+1\). This mapping can either be deterministic or stochastic.

  5. 5.

    G(.): is a mapping \(G: Q \mapsto A\), and is called the output function. G determines the action taken by the automaton if it is in a given state as: \({\alpha }(t)=G[q(t)]\). With no loss of generality, G is deterministic.

If the sets Q, B and A are all finite, the automaton is said be finite.

The Environment, E, typically, refers to the medium in which the automaton functions. The Environment possesses all the external factors that affect the actions of the automaton. Mathematically, an Environment can be abstracted by a triple \(\langle A, C, B \rangle\). A, C and B are defined as follows:

  1. 1.

    \(A=\{{\alpha }_1, {\alpha }_2, \ldots , {\alpha }_r\}\) is the set of actions.

  2. 2.

    \(B = \{{\beta }_1, {\beta }_2, \ldots , {\beta }_m\}\)is the output set of the Environment. Again, we consider the case when \(m = 2\), i.e., with \(\beta = 0\) representing a “Reward”, and \(\beta = 1\) representing a “Penalty”.

  3. 3.

    \(C=\{c_1 , c_2 ,\ldots , c_r\}\) is a set of penalty probabilities, where element \(c_i \in C\) corresponds to an input action \({\alpha }_i\).

The process of learning is based on a learning loop involving the two entities: the random environment (RE), and the LA, as described in Fig. 3. In the process of learning, the LA continuously interacts with the environment to process responses to its various actions (i.e., its choices). Finally, through sufficient interactions, the LA attempts to learn the optimal action offered by the RE. The actual process of learning is represented as a set of interactions between the RE and the LA.

Fig. 3
figure 3

Feedback loop of LA

The automaton is offered a set of actions, and it is constrained to choose one of them. When an action is chosen, the Environment gives out a response \(\beta (t)\) at a time “t”. The automaton is either penalized or rewarded with an unknown probability \(c_i\) or \(1-c_i\), respectively. On the basis of the response \(\beta (t)\), the state of the automaton \(\phi (t)\) is updated and a new action is chosen at \((t+1)\). The penalty probability \(c_i\) satisfies:

$$\begin{aligned} c_i = Pr[\beta (t)=1 | \alpha (t)=\alpha _i] \quad in (i=1,2,\dots , R). \end{aligned}$$

We now provide a few important definitions used in the field. P(t) is referred to as the action probability vector, where, \(P(t)=[p_1(t),p_2(t), \ldots , p_r(t)]^T\), in which each element of the vector.

$$\begin{aligned} p_i(t) = Pr[{\alpha }(t)={\alpha }_i],\ i=1, \ldots ,r,\ \text{ such } \text{ that } \sum \limits _{i=1}^{r} p_i(t)=1\ \ \forall t. \end{aligned}$$
(5)

Given an action probability vector, P(t) at time t, the average penalty is:

$$\begin{aligned} M(t)= & {} E[\beta (t)|P(t)] = Pr[\beta (t)=1|P(t)] \nonumber \\= & {} \sum \limits _{i=1}^{r} Pr[\beta (t)=1|\alpha (t)={\alpha }_i]\ Pr[\alpha (t)={\alpha }_i] \nonumber \\= & {} \sum \limits _{i=1}^{r} c_i p_i(t). \end{aligned}$$
(6)

The average penalty for the “pure-chance” automaton is given by:

$$\begin{aligned} M_0 = {1 \over r} \sum \limits _{i=1}^{r} c_i. \end{aligned}$$
(7)

As \(t \mapsto \infty\), if the average penalty \(M(t) < M_0\), at least asymptotically, the automaton is generally considered to be better than the pure-chance automaton. E[M(t)] is given by:

$$\begin{aligned} E[M(t)] = E\{E[\beta (t)|P(t)]\}=E[\beta (t)]. \end{aligned}$$
(8)

A LA that performs better than by pure-chance is said to be expedient.

Definition 2

A LA is considered expedient if:

$$\begin{aligned} {\text{ lim }}_{t \mapsto \infty } E[M(t)] < M_0. \end{aligned}$$

Definition 3

A LA is said to be absolutely expedient if \(E[M(t+1)|P(t)] < M(t),\) implying that \(E[M(t+1)] < E[M(t)]\).

Definition 4

A LA is considered optimal if \({\text{ lim }}_{t \mapsto \infty } E[M(t)]=c_l,\) where \(c_l={\text{ min }}_i\{c_i\}\).

It should be noted that no optimal LA exist. Marginally, sub-optimal performance, also termed above as \(\epsilon\)-optimal performance, is what LA researchers attempt to attain.

Definition 5

A LA is considered \(\epsilon\)-optimal if:

$$\begin{aligned} {\text{ lim }}_{n\mapsto \infty } E[M(t)] < c_l + \epsilon , \end{aligned}$$
(9)

where \(\epsilon > 0\), and can be arbitrarily small, by a suitable choice of some parameter of the LA.

3.1 Classification of Learning Automata

3.1.1 Deterministic Learning Automata

An automaton is termed as a deterministic automaton, if both the transition function F(., .) and the output function G(.) are deterministic. Thus, in a deterministic automaton, the subsequent state and action can be uniquely specified, provided the present state and input are given.

3.1.2 Stochastic Learning Automata

If, however, either the transition function F(., .), or the output function G(.) is stochastic, the automaton is termed to be a stochastic automaton. In such an automaton, if the current state and input are specified, the subsequent states and actions cannot be specified uniquely. In such a case, F(., .) only provides the probabilities of reaching the various states from a given state.

In the first LA designs, the transition and the output functions were time invariant, and for this reason these LA were considered “Fixed Structure Stochastic Automata” (FSSA). Tsetlin Krylov, and Krinsky [12] presented notable examples of this type of automata.

Later, Vorontsova and Varshavskii introduced a class of stochastic automata known in the literature as Variable Structure Stochastic Automata (VSSA). In the definition of a VSSA, the LA are completely defined by a set of actions (one of which is the output of the automaton), a set of inputs (which is usually the response of the Environment) and a learning algorithm, T. The learning algorithm [8] operates on a vector (called the Action Probability vector).

Note that the algorithm T : [0,1]\(^R\)\(\times\) A \(\times\) B \(\rightarrow\) [0,1]\(^R\) is an updating scheme where A = {\(\alpha _{1}\), \(\alpha _{2}\), ..., \(\alpha _{R}\)}, 2 \(\le\) R < \(\infty\), is the set of output actions of the automaton, and B is the set of responses from the Environment. Thus, the updating is such that

$$\begin{aligned} P(t+1) = T(P(t), \alpha (t), \beta (t)), \end{aligned}$$

where P(t) is the action probability vector, \(\alpha (t)\) is the action chosen at time t, and \(\beta (t)\) is the response it has obtained.

If the mapping T is chosen in such a manner that the Markov process has absorbing states, the algorithm is referred to as an absorbing algorithm. Many families of VSSA that posses absorbing barriers have been reported [8]. Ergodic VSSA have also been investigated [8, 34]. These VSSA converge in distribution, and thus, the asymptotic distribution of the action probability vector has a value that is independent of the corresponding initial vector. While ergodic VSSA are suitable for non-stationary environments, absorbing VSSA are preferred in stationary environments.

4 Related work

4.1 Distributed LA on a graph

Misra and Oommen pioneered of the concept of concept of LA on a graph using pursuit LA [13, 35, 36] for solving the stochastic shortest path problem. Li [37] used a type of S LA [38] to find the shortest path in a graph. Beigy and Meybodi [39] provided the first proof in the literature that shows the convergence of distributed LA on a graph for a reward inaction LA. For applications of distributed LA on a graph in the field of computer communications, we refer the reader to the work of Torkestani and his collaborators [40,41,42].

4.2 LA for classification and function optimization

In order to put our work in the right perspective, we will briefly discuss different classification schemes relevant to this work from the field of LA theory.

In general terms, the distinguishing characteristic of LA-based learning is that the search for the optimizing parameter vector is conducted in the space of probability distributions defined over the parameter space, rather than in the parameter space itself [43]. In machine learning, the most common method for building a classifier is to conduct a search over the parameter space using optimization techniques such as gradient descent, while the common and recurrent theme reported in the literature when building a classifier based on LA is to work in a probability space rather than a parameter space. The main advantage of working in a probability space is better resilience to noise. This resilience to noise was demonstrated in [43] where the true label of each data point in the training data is noisy in the sense that it is revealed by an Oracle according to a faulty model. It was demonstrated in some cases, LA performs better than other classical classification algorithms such as feedforward neural networks even with discretized parameter space, and thus a limited number of possible parameters which might reduce the accuracy of the scheme [43]. It is worth mentioning that Continuous Action LA (CALA), in contrast to classical LA, does not discretize the parameter space and rather operates on a continuous parameter space where the choices of the parameter are drawn from a time-varying sampling distribution that is adjusted based on ideas borrowed from the field of reinforcement learning [44].

In [44], another structure of LA algorithms used for classification is presented which possesses a multi-layer representation similar to neural networks. The actions of the first level LA are real-value parameters of the hyperlanes. The second level of LA is Boolean decisions regarding which hyperlanes to be included to create convex sets using an AND operation. The final layer of LA performs an OR operation on the outputs of the second layer units. Therefore, the discriminant is a Boolean expression consisting of linear inequalities [44]. Similar ideas were applied in order to learn the decisions trees classifiers using LA teams [45] where an individual LA can be used to learn the best split rule at a given node.

A closely related work to ours is due to Thathachar and Sastry [46] where the authors use a team of LA in order to find the optimal discriminant function in a feature space. The discriminant functions are parametrized, and an LA are attached to each parameter. The LA team is involved in a cooperative game with common payoff. The general theme is to classify the next pattern with the chosen discriminant function and to either reward or penalize the joint action of to the team depending on whether the classification agrees with the true label or not. Later, Santharam et al. [47] proposed to use continuous LA in order to deal with the disadvantages of discretization, thus allowing an infinite number of actions. For an excellent review on the application of LA to the field of Pattern Recognition we refer the reader to [44] . In [48], Zahiri devised an LA-based classifier that operates using hypercubes in a recursive manner. We believe that the latter idea can be used to extend our current solution: PolyLA for handling multi-dimensional classification problems. In [49], the authors have proposed LA optimization methods for multimodal functions. Through experimental settings, the performance of these algorithms was shown to outperform genetic algorithms.

Some improvements of the latter algorithm were introduced in [50] to better remove and regenerate the hypercubes and to better update the LA probabilities which yielded better accuracy.

In [51], the authors introduce a combination of the LA and genetic algorithms for real-valued function optimization. The latter algorithm termed GLA bears similarity to the population-based incremental learning algorithm. The main task in Pattern Recognition is to output a class label from a feature vector given as input. In [52], LA was used where the actions of the LA are the possible classes. An LA gets rewarded or penalized in the training phase depending on the real class of the input. However, according to Barto and Anandan: “an action is optimal only in the context of certain feature vectors” [52]. This problem is known as associative learning where the aim is to learn to associate different inputs to different actions.

Moreover, LA was also used to learn the parameters of neural networks as an alternative of the classical gradient descent methods [53].

4.3 Swarm intelligence for classification

Swarm intelligence denotes a set of nature-inspired paradigms that have received a lot of attention in computer science due to its simplicity and adaptability [54]. Ant Colony Optimizaiton (ACO) figures among the most popular swarm intelligence algorithms due to its ability to solve many optimization problems. ACO involves artificial ants operating a reinforced random walk over a graph. The ants release pheromones in favorable paths which subsequent ant members follow creating a reinforcement learning-based behavior. The colony of ants will thus concentrate its walk on the most favorable paths and in consequence iteratively optimize the solution [55].

Recently, work on ACO for classification where the ants perform walks to separate classes has been published [4, 56,57,58].Footnote 3 The approach, named PolyACO, relies upon ants walking in two and many dimensions to circumvent and separate classes from each other, and in this way constructing decision boundaries not limited by linear or polynomial functions. Our current work is inspired by PolyACO [4] which pioneered the idea of using the reinforced random walk over a polygon for solving classification problem. There are two main differences between PolyACO, and between our approach PolyLA. First, PolyLA is less computationally intensive than PolyACO as the latter uses global updates while the former resorts to local updates. In fact, because of the evaporation effect of the trails, all the pheromones of all edges in the graph need to be updated at each iteration in PolyACO. In PolyLA, local updates are performed as only the LA probabilities of the edges of nodes along the chosen path are adjusted. Despite the simplicity of PolyLA, we shall show that it exhibits comparable performance to PolyACO in the experimental Sect. 6. The second difference lies in the fact that PolyLA uses negative feedback update by virtue of applying the theory of LA. The term negative feedback was reckoned by Di Caro and Dorigo in their seminal work [59] where they contrast LA and ACO approaches for distributed routing over a graph. In [59], Di Caro and Dorigo pointed out the difficulty of creating LA systems that perform well over graph problems due to stability problem. According to Di Caro and Dorigo [59], “ it would be interesting to investigate the use of negative reinforcements, even if it can potentially lead to stability problems, as observed by people working on older automata systems.” In simple words, negative feedback arises as each node involved in the chosen path performs local updates by reducing the choice probability of the non-walked edge while increasing the choice probability of the edge lying along the nodes of the chosen path at the given iteration. ACO only uses positive feedback as the edge along the walked path is reinforced via pheromones. In this paper, we provide theoretical results that show that LA converge to an optimal solution. The theoretical results are novel in the field of LA as this work is one of the few works that presents formal proofs for the convergence LA on a distributed graph while related LA works usually conjecture similar theorems [13, 35, 36].

4.4 Support vector machine

Classification problems usually involve finding classification boundaries in feature spaces. Among the early and most popular classifiers figures the perception algorithm.

Perception works based on “error-driven learning” where it iteratively learns a linear separator model by adjusting the weights of the model whenever it misclassified an item from the training data.

However, the major limitation of perception algorithm is the fact that it only finds a linear decision boundary which works well for linearly separable data but fails to handle the case of nonlinearly separable data. In order to deal with the limitation of linear classifier, nonlinear SVM variants were proposed. SVM tries to circumvent over-fitting by choosing the maximal margin hyperplane where the margin is the smallest distance between the decision boundary and any of the data points.

A powerful concept in SVM is the “kernel trick” equivalent to mapping the data to higher-dimensional feature space in which the data items can be separable. Despite the well recognized performance of SVM in machine learning community, the task of choosing the right type of kernel, for example, linear, polynomial, Gaussian is considered as a black art!

5 PolyLA

This section presents our approach for the two-class classification by introducing PolyLA. For the training phase, it maps the classification problem to a combinatorial optimization problem over the set of all different polygons in a grid system and by formally specifying an appropriate cost function that encircles one class. Thus, PolyLA trains the classifier by defining a polygon s. Subsequently, it uses s with ray casting to find if an item is part of the s.

Figure 4 presents an overview of the approach in the case of a simple two-class classification problem. The data are separated using a team of distributed LA yielding a polygon. Next, the polygon is used in the classification with ray casting. In this example, the first item to be labeled will be classified as a \(T_1\) (“Class 1”) since it is shown to be inside the polygon, while the second item will be classified as \(T_2\) (“Class 2”) since it is outside the polygon.

Fig. 4
figure 4

Overview of approach applied to a simple classification problem

In order to use a team of distributed LA for encircling points into polygons, we resort to a cost function that measures the quality of PolyLA solution. In order to find whether a point is within a polygon, we use ray casting.

5.1 Distributed LA

At each epoch, a polygon is chosen randomly according to a distribution over a set of possible paths. The polygon represents a self-enclosing path where the source coincides with the destination. The observed performance (classification accuracy) is used to reinforce the polygon by increasing the probability of choosing it again. Since the paths yielding low performance receive weak reinforcement signals, they are chosen less frequently. Thus, the scheme can adaptively focus more resources on paths that yield high performance.

Given a grid modeled as a graph \(G = (V,E)\), where \(V = \{1,...,m\}\) is the set of nodes in the graph, E is the set of directed links in the graph. We attach a LA to each node in the graph. The action of each LA attached to a node is the choice of the next hop (neighbor node). Let N(i) be the set of the neighbors of a node i.

The automaton’s state probability vector at the node i at time t is \(\bar{\pi }_{i}^{D}(t)=[\pi _{i 1}^{D}(t) .\mathbb {1}_{N(i)} (j) , \pi _{i 2}^{D}(t) . \mathbb {1}_{N(i)} (2), \ldots \pi _{i m}^{D}(t).\mathbb {1}_{N(i)} (m)]\). Where \(\mathbb {1}_{N(i)}\) is the indicator function which is such \(\mathbb {1}_{N(i)}(j)\) equals 1 if node \(j \in N(i)\) otherwise \(\mathbb {1}_{N(i)}(j)=0\). This simple notation is just to emphasize that the only actions are the neighbors of the node i. Note also that \(\pi _{i i}^{D}(t)=0\). The normalized feedback function (or reward strength) is given by f(s(t)), where s(t) is the path taken at instant t. The function f(.) will be specified in the next section. Loosely speaking f(.) measures the fitness of the solution taking values from [0, 1] where 0 is the lowest possible reward while 1 is the highest reward.

The LA update equations at node S are given by:

$$\begin{aligned} \pi _{Sj}^{D}(t+1)=\pi _{Sj}^{D}(t)+\lambda f(s(t)) (\delta _{ju}-\pi _{Sj}^{D}(t)) \end{aligned}$$
(10)

Where u is the next hop chosen by the LA attached at the source S.

$$\begin{aligned} \delta _{ju} = {\left\{ \begin{array}{ll} 1 &{}\text{ if } j=u \\ 0 &{} \text{ else } \end{array}\right. } \end{aligned}$$
(11)

Note that, initially

\(\pi _{Sj}^{D}(0)= \frac{1}{ \mid N(S) \mid }\), for \(j \in N(S)\).

Similarly, we can define the equation for the update along the path s(t) that starts at the source node S and ends at destination node \(D=S\).

With the updating formula (Eq. 10), we can show that the probability distribution formula converges to the distribution that satisfies the following property if the optimal polygon is unique.

$$\begin{aligned} \pi _{Sj}^{D} = {\left\{ \begin{array}{ll} 1 &{}\text{ if } j=j^{*} \\ 0 &{} \text{ else } \end{array}\right. } \end{aligned}$$
(12)

Algorithm 1 summarizes the entire process in a high-level pseudocode algorithm of PolyACO.

figure a

Example

Suppose for example that from node S, node \(j_1\) is visited, subsequently node \(j_2\) then node \(j_3\) then node \(j_4\), then node S again. Hence, all the probability distributions \(\bar{\pi }_{S}^{D}(t)\), \(\bar{\pi }_{j_1}^{D}(t)\), \(\bar{\pi }_{j_2}^{D}(t)\), \(\bar{\pi }_{j_3}^{D}(t)\), \(\bar{\pi }_{j_4}^{D}(t)\) are updated according the value of the path \(s(t)=(j_1, j_2, j_3, j_4, S)\).

Theorem 1

Let \(s^*\) is path yielding the highest f(s). And let \(i^*\) a node along \(s^*\). When the learning gain \(\lambda\) is sufficiently small, \(\pi {}_{i^* j}^{D}\) in the cross-correlation learning algorithm converges to the scalar \(\theta {}_{ij}^{D}\) which yields the highest accuracy, i.e., \(\lim _{t \rightarrow \infty } P{|\pi {}_{i^* j}^{D} - \theta {}_{i^* j}^{D} | > \epsilon } = 0\), where \(\theta {}_{i^* j}^{D}=1\) if both node i and j are along the optimal path otherwise, \(\theta {}_{i^* j}^{D}=0\) if i along the best path while j is not.

Proof

We shall prove that the learning algorithm defined converges to the optimal solution defined by the edges of the optimal polygon.

In the stochastic network environment, according to the Kushner’s weak convergence method [60] and following the proof in Vazquez-Abad and Mason’s work [61] as well as the proof by Li et al. [37], we can derive from the cross-correlation algorithm that as the learning gain \(\lambda\) goes to zero, the following equation is satisfied:

$$\begin{aligned} {\mathrm{d}\pi {}_{Sj}^{D}(t)\over \mathrm{d}t} = -\lambda {\pi {}_{Sj}^{D}(t)}(\Delta {}_{Sj}^{D}(t)-\sum _{u}\Delta {}_{Su}^{D}(t)\theta {}_{Su}^{D}) \end{aligned}$$

\(\lambda\) corresponds to an update rate.

\(\Delta {}_{Sj}^{D}\) corresponds to the average value of f(s(t)) where s(t) includes the nodes S and j: we describe it by \(Sj \in s(t)\), meaning edge Sj belongs to the path. More formally \(\Delta {}_{Sj}^{D}=E(f(s(t))\mid \bar{\pi }_{i}^{D}(t), 1\le i \le m, \textit{and } Sj \in s(t) )\).

To show that the solution is globally stable, let us define \(M{}_{S}^{D}(t)=\sum _{j}\pi {}_{Sj}^{D}(t) \Delta {}_{Sj}^{D}\).

From the cross-correlation learning algorithm [37], we can write:

$$\begin{aligned} M{}_{S}^{D}(t+1)-M{}_{S}^{D}(t)=-\sum _{j}\pi {}_{Sj}^{D}(t)\left( {\Delta {}_{Sj}^{D}}^{2}-\left( \sum _{j}\pi {}_{Sj}^{D}(t)\Delta {}_{Sj}^{D}\right) ^{2}\right) \end{aligned}$$
(13)

Let \(s^*\) the optimal path possessing the best performance. \(M{}_{S}^{D}(t+1)-M{}_{S}^{D}(t) \le 0\) since, i.e. , since \((\Delta {}_{Sj}^{D})^{2}-(\sum _{j}\pi {}_{Sj}^{D}(t)\Delta {}_{Sj}^{D})^{2}\) equals the variance of \(\Delta {}_{Sj}^{D}\). Let \(M(t)=\sum _{i^* \in s^*} M{}_{i^*}^{D}\). Then, M(t) is monotonically decreasing with each update of the vector \(\bar{\pi {}}\) for i along \(s*\). Let \(\Delta M(t)=\sum _{i \in s^*} M{}_{i^*}^{D}(t+1)-M{}_{i^*}^{D}(t)\). When \(\pi {}_{i^* j}^{D} = \theta {}_{i^* j}^{D}\), \(\Delta M(t)=0\) and reaches a stationary state.

Hence, when the learning gain is sufficiently small, the expected rewards keep increasing with time. The optimal solution may be not unique, but these optimal solutions will give the same value for the objective function. \(\square\)

5.2 Cost function

Equation 14 represents the cost function. The cost function takes into account the information about whether an item \(t_i\) is inside or outside of a polygon s (see Sect. 2). This cost function measures how good a polygon s is at encircling and isolating one class in the training data and is defined as:

$$\begin{aligned} f(s)=\frac{\sum _{t_i \in T_1} h(t_i,s) +\sum _{t_j \not \in T_1} (1-h(t_j,s)) }{|T|}. \end{aligned}$$
(14)

In layman’s terms; function 14 gives the percentage of items that are either correctly inside or correctly outside of the polygon. From the example in Fig. 1, the red polygon s correctly encircles all items of class \(T_1\), while correctly avoids to encircle any other items from the other class \(T_2\). Since s is a polygon that perfectly separates the two classes, it gives \(f(s)=1\).Footnote 4

The problem reduces to optimizing f(s), given the training data T, subject to the search space \(\mathbf {S}\)—which is equivalent to finding an \(s* \in \mathbf {S}\) so that \(f(s*)\le f(s)\in \mathbf {S}\).

5.3 Ray casting

Vertical ray casting is used to consider whether an item is within or outside a many-edged polygon [62]. Ray casting is a simple algorithm that determines where a virtual ray enters and exits a given solid.

In a two-dimensional XY-plane, a ray is sent with a y-coordinate and starting at 0 and is increased by one very time an edged is passed. When the ray hits the item to be labeled, whether it is inside or outside the polygon is determined by reading the bit. An even number means outside, while an odd number means inside. Formally, for node \(t_i\) and a polygon s, we get \(h(t_i,s)\) representing to what extent it is inside or outside of the polygon as follows, extending Eq. 1:

$$\begin{aligned} h(t_i,s) = \left\{ \begin{array}{ll} 1 &{} \quad \hbox {if}\; t_i \in T_1 \;\hbox {and is inside of}\; s.\\ 0 &{} \quad \hbox {otherwise} \end{array} \right. \end{aligned}$$
(15)

\(h(t_i,s)\) gives 1 if \(t_i\) is correctly inside of the polygon, 0 otherwise. Note that the cost function f(s) in Eq. 14 handles both items correctly inside and correctly outside of polygons.

5.4 Remark about uniqueness of the path

An implicit assumption is that the optimal path is unique. However, in many cases, the optimal polygon is not unique and there might be multiple polygons yielding the same performance. This will result in multiple equilibrium [61]. Our experimental results confirm the convergence to one of the equilibriums.

5.5 Training phase

The classifier is trained using a guided walk with the team of distributed LA optimizing for the score function f(s) in order to create a polygon. By virtue of the reinforcement learning mechanism, the actions of the team of LA will converge toward a polygon that is a good separator. This polygon is the key to the classification.

Note that the classifier, implicitly, performs optimization according to the score function f(s).

The classifier can therefore be considered as a many-edged polygon with only vertical or horizontal edges.

The LA random walker is not allowed to walk on nodes that has previously been selected, except for the initial starting node.

5.6 Bootstrapping the source node

A detail worth mentioning is the way by which we choose of the source node of the polygon. The performance of the scheme is dependent on the right choice of the source vertex for the polygon. In order to deal with this disadvantage, we allow the scheme to re-adjust its choice of the source vertex. Whenever a polygon gives a better performance compared to previous iterations, we choose a random node among the nodes part of the best known polygon as the source node. Note that when probabilities have converged, our experience is that as long as the source node is part of the best polygon, the choice of source node is of little importance.

More advanced methods can be used and verified empirically. However, we found that the latter simple strategy gave good performance.

6 Experiments

The experiments are carried out as traditional supervised learning approaches in two phases: training and classification. The behavior of the algorithm can be best explained by examining how it behaves on the training data. Because of this, the figures depict a visualization of the polygon on the training data—yielding a good overview of the algorithm behavior.

The data are generated by various functions intended to show the performance of PolyLA in various settings. In each experiment, 2000 data points are generated, of which half are randomly selected for training and the rest used for classification. Further, the data always contain two classes: the blue \(T_1\) class and the red \(T_2\) class.

The granularity of the grids is always chosen as \(10\times 10\). A summary of the results is presented in Table 1.

6.1 Simple environments

We present a simple experimental settings as proof of concept of PolyLA. This section empirically shows that the approach works in a simple environment with two easily separable sets of data. The data are composed of two blocks of data: \(T_1\) and \(T_2\). Figure 5 shows the LA convergence after the training phase in this environment. The LA have built a rectangular polygon encircling all items in \(T_1\), but none of the items in \(T_2\). Since this is a polygon that perfectly separates the classes, it yields \(f(s)=1\). The polygon solution in this example is quite straight forward. In this simple proof of concept, PolyLA gave an accuracy of 100%.

Fig. 5
figure 5

Simple data set with 0% noise

6.2 Gaussian

Figure 6 depicts the classification polygon found by the distributed LA for data generated from two different Gaussian distributions. This experiment is interesting because, in contrast to the proof of concept, due to the overlapping data no classifier will be able to give a perfect result and is therefore a good test for PolyLA.

For the classification, the PolyLA classifier gave a accuracy of 0.846, a recall of 0.836 and a precision of 0.853. For comparison purposes, linear and polynomial SVM gave the same data accuracies of 0.837 and 0.842, respectively.

These high numbers indicate that PolyLA is able to perform in line with SVM even when data are overlapping—without loss of precision or recall.

Fig. 6
figure 6

Gaussian distribution with overlap

6.2.1 Semi-circles

Figure 7 shows the scheme in a more complex scenario with semi-circles (or half moons) where there are no clear separation boundaries. It is an interesting experiment because there exists no linear or polynomial solution that can result in a perfect classifier without mapping to multiple dimensions or depending on a kernel trick.

Despite the added complexity, the PolyLA approach works perfectly and surrounds the data from the blue class without including the red. In fact, in the classification phase it gives and accuracy, precision and recall of 1. For comparison purposes, linear and polynomial SVM gave accuracies of 0.912 and 0.997 on the same data.

Fig. 7
figure 7

Half-moon

6.2.2 Circles

Figure 8 illustrates the case of nonlinear classification boundary in the form of a circle.

Despite the added complexity, the PolyLA approach works perfectly by surrounding the data from the blue class without including the red. Again, the accuracy, precision and recall for PolyLA are 1, while for linear and polynomial SVM gives accuracies of 0.538 and 0.892, respectively. For SVM to come up to the performance of PolyLA, we need to rely on a RBF kernel.

Fig. 8
figure 8

Circular without noise

In Fig. 9, we add some noise to the data of 5%. Noise means simply that some points of one class are overlapping with the other class making impossible to separate between these overlapping points. Despite the added noise, the scheme performs well. We would expect an approximate 2.5% loss in accuracy because of the 5% noise. Our empirical results confirm this by giving an accuracy of 0.973. For comparison purposes, SVM performance drops significantly by adding noise.

Fig. 9
figure 9

Circular with 5% noise

6.2.3 Gaussian blobs

Figures 10, 11, 12 and13 depict the case of data generated from Gaussian distributions with blob distance that are, respectively, 140, 120, 60 and 0.

These are interesting results because it explains the behavior of PolyLA when the data are overlapping more and more, and in turn becoming more and more difficult to separate. In the most extreme, with a distance of 0 in Fig. 13, the data are overlapping and should be indistinguishable from complete random data.

In Fig. 10, the data are barely overlapping and PolyLA yields in the classification phase an accuracy of 0.946, a recall of 0.936 and precision of 0.956.

Fig. 10
figure 10

Gaussian blob distance 140

In Fig. 11, the data are slightly more overlapping. However, PolyLA has barely any drop in performance. It is still able to accurately separate the data and yields in the classification phase an accuracy of 0.943, a recall of 0.936 and precision of 0.938.

Fig. 11
figure 11

Gaussian blob distance 120

In Fig. 12, the data are overlapping a lot and PolyLA yields in the classification phase an accuracy of 0.747, a recall of 0.684 and precision of 0.78. It is noteworthy that, by examining the figure, it is apparent that a higher granularity of the grid would give a better algorithm performance.

Fig. 12
figure 12

Gaussian blob distance 60

In Fig. 13, the data are completely overlapping and the classes are indistinguishable from each other. Clearly, PolyLA has a very different behavior here than in Figs. 12, 11 and 10. In this scenario, there is no apparent pattern in the polygon. As with the data, the polygon appears random. Our empirical results confirm the results giving in the classification phase an accuracy barely above random of 0.526, a recall of 0.528 and a precision of 0.486.

This indicates that PolyLA is able to accurately classify data, even when the data are overlapping and hard to distinguish. Only when the two classes are completely overlapping will PolyLA come to short.

6.2.4 Real-life data sets

In the above experiments, we have focused on illustrating the performance of the PolyLA using figures that demonstrate its ability to perform separation. At this juncture, we shall use two real-life data sets: the Iris data set and the Wine data set. It is worth mentioning that originally PolylA does not handle directly the case of multi-dimensional classification arising in the case of Iris and Wine data sets. We deal with the multi-dimensional case according to the method detailed in Sect. 2.3.

We also need to emphasize that PolyLA possesses similar performance to PolyACO by examining the results reported in [56]. In fact, PolyACO achieves 0.948 accuracy while PolyLA outperforms it by achieving 0.97 for the circular case with \(5\%\) noise. However, PolyACO achieves slightly higher performance for the Iris data set, namely 0.96 accuracy, while PolyLA achieves 0.82. When it comes to the Wine data set, the performance for PolyLA is 0.68 while PolyACO yields 0.69. Furthermore, in Table 1, we compare against three neural networks, one hidden layer neural network (1NN), two hidden layers neural network (2NN) and three hidden layer neural network (3NN). Please note that 2NN and 3NN are considered as deep learning algorithms. We observe from Table 1 that those neural networks outperform PolyLA, Linear SVM (LSVM) and Polynomial SVM (PSVM). Although PolyLA is able to find nonlinear classification boundaries that might be complex to find and consequently outperform LSVM, it is less accurate compared to deep neural networks which excel in builduing nonlinear decision boundaries.

Fig. 13
figure 13

Gaussian blob distance 0—indistinguishable from random noise

Table 1 Summary of PolyLA performance

7 Conclusion

In this paper, we propose a novel classifier in two-dimensional feature space based on the theory of Learning Automata. The essence of our scheme is to search for a separator in the feature space by imposing an LA-based random walk in a grid system. To each node in the gird, we attach an LA, whose actions are the choices of the edges forming the separator. Indeed, PolyLA has appealing properties compared to SVM. While SVM performance is subject to the user choice of the kernel function, PolyLA can find arbitrarily complex separator in feature space without any user guidance. We provide theoretical results that demonstrate the convergence of PolyLA based on the theory of weak convergence [60]. Comprehensive experimental results illustrate the performance of our method and its superiority to SVM in most cases. PolyLA faces challenges when dealing with multi-dimensional data as inevitably, the number of planes explodes as the number of features increases. It would be interesting to investigate mitigating this issue in future work.