Introduction

The goal of gazing toward the objects of interest with a camera attached to a computer has contributed to the creation of an important research area, devoted to the development of computational models for visual attention (VA). Nowadays, within the neuroscience community, VA is understood as a natural process performed by the brain, whose functionality is to perceive salient visual features. This cognitive task is necessary since it is impossible to focus the sight at two different objects during a single unit of time and space. Moreover, VA is a skill which allows a creature, living or artificial, to direct their gaze rapidly toward the objects of interest in the visual environment [1]. Thus, the objects of interest refer to those locations or regions in the image that are projected from the environment, and which contain important information at a given time. Hence, visual attention is considered as one of the most important mechanisms in the visual system, whose functionality in the brain or visual cortical areas is to provide with the selectivity process that filters the visual information coming from the retina through the visual cortex. In the past, it was widely believed that human observers constructed a complete representation of everything in their visual field [2, 3]. Today, within the computational neuroscience and computer vision communities, it is well known that in order to process the high amount of visual information, it is necessary to break down the problem of image understanding into a series of simple, local and serial image operations that can achieve near real-time performance with typical computer systems.

In nature, the VA functionality, regarding the localization of objects, is related to the brain areas around the dorsal stream, see Fig. 1. Thus, the dorsal pathway has been defined as projecting from V1 through V2, V3, the middle temporal area (MT), the medial superior temporal area (MST) and finally to the posterior parietal cortex [4]. Nevertheless, there is a lack of consensus about the specific brain areas, structure and functionality that conform the dorsal stream. For example, in the work of Milner and Goodale [5], the dorsal stream is also known as the “how” stream; while in the work developed by Baluch and Itti [6], the dorsal stream process differs significantly with respect to previous works.

Fig. 1
figure 1

The correspondence between the dorsal stream areas and the stages of the artificial model. The idea is to emulate the transformations that the input image undergoes along the visual attention pathway of the natural system

Nowadays, classical explanations of VA are in accordance with the idea that the dorsal stream is influenced by bottom-up (BU) and top-down (TD) factors. Thus, some works explain the existence of two interacting neural systems that are involved in the control of both factors to approach the problem of visual attention [712]. In general, TD or cognitive factors are voluntary; i.e., knowledge, expectations, desires and current goals. While, BU or reactive factors depend entirely on stimulus; i.e., novelty and unexpectedness. As a result, VA is usually studied as two separate proposals [8] focusing on BU or TD processes. Next, we formulate the main problem to be studied in this paper.

Problem Statement

This paper describes a system that automatically designs novel computational models of an artificial dorsal stream (ADS). The problem is that of solving the automatic design of visual attention programs (VAPs) without manually reengineering the best solutions according to the task at hand; i.e., the application to a specific image database. Therefore, VA can be understood as a process that seeks to emulate its natural counterpart by mimicking two main tasks [26].

  • Acquisition, in this stage, features are captured early, automatically and in parallel across the visual field using visual receptors. Normally, the receptors are specialized to react to specific scene properties such as orientation, color, brightness and movement. Afterward, the information of receptors is transmitted and processed along the different areas of the brain.

  • Integration, in this stage, the separate visual features present in the environment are combined in order to perceive the most prominent region while seeking to attend the objects of interest contained within the scene.

In this work, VA is seen as the product of an optimization process, where BU and TD factors are integrated to produce this single basic phenomenon. Thus, as in biological VA, the algorithm is composed of two tasks that are used in the perception and action cycle. The first task, related to perception, is ascribable to the limited processing capacity of the visual system; while, the second function related to action is a consequence of the selectivity or ability to filter unwanted visual information. In this way, it is said that VA selectivity is controlled by both cognitive or TD factors such as knowledge, expectation and current goals; as well as, involuntary or BU factors that refer to reactive and sensory stimulation [9, 11]. Nonetheless, an emphasis is given here to the ability of solving VA problems under a unique optimization framework [13]. It is for this reason that in this paper VA is studied as a unique process resulting from a single mechanism designed to obey a given general purpose, which could be made of different particular goals aiming to design a relationship between the observer and the scene. In particular, this paper details the application of the proposed methodology for the solution of challenging TD problems that have been previously employed in the literature and which are used here as a benchmark to compare our proposal with four state-of-the-art algorithms [14]. We considered this framework as an unified approach for VA [15].

As a result, we identify some problems related to both stages of VA. Firstly, the selectivity process related to the acquisition stage is understood as the ability to filter out unnecessary visual information and whose main aim is to render visible the objects of interest. In other words, selectivity responds the questions: “What features should be utilized?” and “When to use a given feature?” However, the answers are not evident and should be searched among the space of possible VAPs. Secondly, the integration of the best possible visual features into a single map of salience, which is seen as a difficult task since different visual dimensions produce a combinatorial problem. Finally, we have the problem of formulating VA in terms of an optimization/search process that looks for an ADS capable of attending a given target. In this paper, we will illustrate how an evolutionary approach allows us to address these issues by implementing a brain programming (BP) strategy.

Research Contributions

This paper outlines the following research contributions.

  • First, the automated design of an ADS is explained, where the problem is formulated in terms of an optimization/search-based approach. The explanation presented extends the previous one of Dozal et al. [13, 15] and Olague et al. [16].

  • Second, a new strategy called brain programming mainly based on genetic programming (GP) is introduced. The approach follows the hierarchical paradigm developed within the computational neuroscience community in combination with the idea that special functions, in the form of mathematical and computational programs, can be evolved to solve the problem of VA. Indeed, the proposed approach is able to design competitive programs that challenge the designs of human experts according to the achieved results using a standard testbed.

  • Finally, the performance of each experiment is reported in Table 5, followed by a detailed analysis of each successful experiment. In this way, the final structures of the best VAPs are described together with some details about the inner workings of the system. Moreover, the article presents results that give statistical insights about the dynamics and overall performance of the evolutionary system.

This paper is organized as follows. A first section devoted to the review of previous works in the area of VA is presented, while giving a brief introduction to GP. Next, the methods of ADS and brain programming are explained from a computational standpoint. Finally, experimental results are provided followed by the conclusions.

Related Work

Since the late nineteenth century, the VA mechanism has been studied by researchers from different scientific disciplines, such as neurologists, physiologists, psychologists, and in the last three-decades by people working at the intersection of neuroscience, cognitive science and artificial intelligence, which is commonly known as cognitive computation. This last community has studied the problem of VA as a feasible way to reduce the complexity of visual information processing [1721]. Recently, a discipline known as cognitive vision [22, 23] was created from the combination of the computer vision and cognitive research areas. Thus, it is said that cognitive vision systems should be able to engage in purposive goal-directed behavior, while adapting robustly to unexpected changes of the visual field, and it should have the ability to anticipate the occurrence of objects or events [22]. Moreover, VA is considered a critical issue in cognitive vision [24].

Owing to the importance of the VA mechanism, many different definitions have been formulated. Some of those ideas are summarized next. At the beginning of the 1980s, Posner et al. [25] proposed that VA could be understood in terms of a “spotlight” that enhances the efficiency of detection. Later, Treisman and Gelade [26] compared VA with the concept of “glue,” which integrates the initially separable features into unitary objects. Then, in the same decade Koch and Ullman [1] described VA as a skill which allows the creature to direct their gaze rapidly toward the objects of interest. Later, two different proposals were developed. In the middle of the 1990s, Desimone and Duncan [9] introduced VA as an emergent property of visual processing that is seen as the product of many neural mechanisms working to resolve the problems of resource competition and control of behavior. Finally, Wolfe [27] explains VA as an “enabler” rather than an actor. In other words, instead of saying that attention somehow identifies an object it is said that attention enables object recognition processes to work on a single item at a given time. Hence, after considering the previous works, we introduce a new VA definition consisting on a functional relationship that filters the visual information according to the task at hand.

Definition 1 (Visual attention)

The concept of VA is defined as a process that establishes a relationship between the different properties or features of the scene, which are perceived through the visual system, with the aim of selecting the most suitable aspect for the task at hand.

In this way, we consider that VA performs both the BU and TD processes through a single computational structure with the goal of achieving the task at hand while considering a purposeful framework [28]. Therefore, according to the feature-integration theory of Treisman and Gelade [26], the VA process is divided in two stages: acquisition and integration. Firstly, during the acquisition, it is said that the receptors are adapted to perceive special features such as orientation, color, brightness and movement. Secondly, the information of receptors is transmitted and processed along different areas of the brain with the goal of creating a suitable description of the objects of interest. Therefore, during the integration stage, all separate features used by the focus of attention task are merged into a single computational structure describing the salient region. In this way, from a computational standpoint, the whole program structure can be seen as a way of describing a region of interest within an image; where the final result is useful for the visual perception task of attending the objects of interest.

Brief History of VA from a Computational Standpoint

As we have reviewed the feature-integration theory of visual attention was proposed by Treisman and Gelade [26]. This theory is the most widely accepted paradigm for VA within the cognitive science community, since it is considered as a fundamental step for VA. In this way, the first neurological and plausible computational model for VA was outlined by Koch and Ullman [1] where it is introduced a model composed of three stages. First, a set of elementary features are computed across the visual field in parallel. Second, the resulting maps are combined into a single saliency map that encodes the prominence of the scene. Third, a winner-take-all (WTA) network is used to iteratively select the most prominent regions. As a result, the WTA network shifts to the next most prominent region after inhibiting the most active location. A year later, Fukushima [29] proposed a model of selective attention useful for visual pattern recognition, which mimics the functionality of the human brain. This model is a hierarchical multilayered neural network capable of paying attention sequentially to two or more patterns. Afterward, Burt [30] described three mechanisms for VA and their computational implementation: foveation, tracking and high-level interpretation. Then, Sandon [31] in 1990 made the first implementation of a VA model based on the work of Koch and Ullman. Later, Olshausen et al. [32] proposed a neurobiological solution to the problem of object recognition using a model of VA that is capable of creating objects representations that are invariant to position and scale. In this case, their work was motivated by the theory of Palmer [33], who suggested that the result of attention should be focused on the object, considering a canonical or object-based reference frame. At the same time, Milanese [34] proposed a VA system that uses mechanisms inspired from biological processes, which were further adopted by the computer vision community and are now commonly utilized until these days [14]. Moreover, Tsotsos et al. [35] developed a biologically plausible model of VA making special emphasis to the computational utility through the application of special VA routines. In addition to the computational explanations that had been proposed in the literature, their idea includes a new winner-take-all algorithm that better matches the functionality of the primate visual system by providing a method for controlling a robot vision system. Nevertheless, one of the most well-known computational models of VA is probably that of Itti et al. [36], which provided a software that popularized the theoretical processes. Later, Itti and Koch [14] compared four different strategies for feature combination to achieve object detection. In their work, they built a simple feed-forward architecture that consists of two main stages: extraction of early visual features and saliency map computation. The first stage could be implemented in parallel, rendering it suitable for real-time applications. Later, Torralba [37] in 2003 created a VA model that incorporates contextual statistical information. The quality of the results showed that contextual information improves object detection by directing the visual search toward the prominent features of the scene. Another important work is that of Walther and Koch [38] where a biologically plausible model is proposed for forming and attending proto-objects aiming to enhance the task of object recognition. Afterward, Cutsuridis [39] discusses a cognitive model of saliency based on the interaction between BU and TD stimulus, which drives VA between saccades. In his work, he offered a hypothesis of how the participating brain areas work together to attend a specific region from a visual scene. Another study developed by Kootstra et al. [40] propose a BU model that predicts eye fixations based on symmetry instead of the contrast feature. Their main result shows that symmetry provides a better prediction in the case of human eye fixations. Finally, Marat et al. [41] considered the problem of face recognition; they proposed the incorporation of a “face" saliency map within existing VA models. The idea is that this map emphasizes areas where a face should be detected in order to improve the prediction of a set of eye movements during video playing.

Brief Introduction to GP

Evolutionary computation incorporates a wide range of techniques to solve problems based on Darwin’s theory of evolution; and the well-known principle of natural selection, commonly known as “survival of the fittest.” Today, the most popular of evolutionary algorithms is GP which was formalized by Koza [42]; and it is understood as a machine learning approach useful in the design of computer programs. GP is an offshoot of genetic algorithms that automatically solves problems without requiring the designer to know or specify in advance the final form or structure of the problem solution. Thus, GP is said to be a systematic, domain independent method; nevertheless, the user usually needs to define the way in which GP evolves programs often in a domain-specific language, through a user-defined task, using a set of functions and terminals together with the given criteria.

In artificial evolution, such as in nature, there is a population of individuals. In GP, the individuals in the population are computer programs, which are usually written as syntax trees or as corresponding expressions in prefix notation. Each individual in the population is evaluated to establish its quality at solving a given problem. Note that such measure is called the fitness function, which is widely known as the objective function. In general, the individuals in the initial iteration, or generation, of the process will have poor fitness. Nonetheless, some individuals in the population will prove to be slightly fitter than others. The fitter individuals survive and produce offspring that replace older and relatively unfit individuals from the population. In Koza’s style of GP, the population is usually of fixed size. The paradigm incorporates the idea of genetic inheritance through some specialized computational operations that are used to produce new programs by either mutation or crossover. The Darwinian principle of reproduction and survival of the fittest, as well as the genetic operations of crossover and mutation, are both inspired by the biological exchange of genetic material and are used to create a new set of individuals to improve the current population. Mutation operators make a random change to a copy of a selected program. In other words, a selected point mutation replaces a node within the tree by another randomly generated tree without imposing an arity constraint; i.e., number of arguments. Crossover works with two selected programs, called parents, from the current population; the idea is to exchange code between parents by randomly selecting subtrees. In Koza’s style of programming subtree, crossover removes one branch from one tree and inserts it into the other. These simple processes, mutation and crossover, should ensure that the new programs are syntactically valid. Also, a problem known as bloat, related to the uncontrollable growth of the programs, is usually approached with a user-defined bound.

Thus, GP can be seen as a machine learning approach that looks into the search space of all possible computer programs that are defined appropriately to the problem domain. In the same way, as any other stochastic process, the GP algorithm can never guarantee an optimal solution; however, this kind of heuristic methods can help in the solution of problems where deterministic methods are unable to offer a suitable solution. In our GP-based approach, that we are calling brain programming, a set of programs are searched with the goal of synthesizing key mathematical operations applied within the hierarchical computational structure of the ADS.

Methods

This section provides a general overview about the proposed optimization methodology useful in the design of an ADS. First, we introduce the biologically inspired model for VA that serves as the general framework; it is on this model where key functions will be optimized through the paradigm of artificial evolution. Next, we provide some details about the implementation of the proposed BP algorithm, which has been applied to generate optimal ADSs to solve a challenging VA-database problem.

The ADS Algorithm

The ADS algorithm is as a computational model that mimics the biological process of focus of attention [1, 29, 30]. The ADS attempts to emulate the natural process that takes place along the visual cortex, specifically through the dorsal stream. Today, the ADS is said to perform its task according to the neurological theory of feature-integration [26]. This theory states that the operation of an ADS is divided into a two main-stage process that works sequentially to fulfill the VA task. The stages are 1) visual feature acquisition and 2) feature-integration. From a computational perspective, the ADS is comprised of several functions arranged in a forward manner. Figure 2 shows the flow diagram of the ADS algorithm, and in the following subsections, we explain both stages of the ADS.

Fig. 2
figure 2

Flowchart of the artificial dorsal stream

Visual Feature Extraction and Visual Maps

In the visual system of primates, visual features follow a pathway across the retina, superior colliculus, lateral geniculate nucleus and early visual cortical areas [43]. Along the pathway, the low level mechanisms applied to the task of feature extraction act in parallel over the entire visual field to provide the stimulus for the subsequent stages of the ADS, which highlight the prominent regions of the image [1, 26]. In fact, it has been shown by Julesz [44] that a set of texture discriminators, called textons, provide a specific set of features that are detected in parallel. These features are specialized in the detection of color, line orientation and certain shape parameters such as curvature and convexity. Moreover, Treisman and Gelade [26] also described such features as acting in parallel with each feature defined to conform a particular dimension. Next, some basic definitions are introduced to understand the overall strategy.

Definition 2 (Image as the graph of a function)

Let f be a function \({f:U \subset \mathbb{R}^2 \rightarrow \mathbb{R}}\). The graph or image I of f is the subset of \({\mathbb{R}^3}\) that consist of the points (xyf(xy)), in which the ordered pair (xy) is a point in U and f(xy) is the value of f at that point. Symbolically, the image \({I = \{(x,y,f(x,y)) \in \mathbb{R}^3|(x,y) \in U\}}\).

This definition recalls that the scene is perceived through a camera composed of a two-dimensional array of sensors that measures the amount of light that reaches them. Note that the definition can refer to an image taken with a camera or to any of the false images issued by an image operation. In this way, the images are the result of the impression of light across the two-dimensional sensor plane, as well as the output issued by the image processing. Moreover, the image I obtained by the camera is defined as the graph of a function, since this mathematical concept is used at the moment of describing the transformations that will be used in our functional approach. Thus, this definition helps us to understand the image as the initial input to a function-driven paradigm of the visual cortex model. Previous works [4, 5, 9] described the brain as divided into several specialized areas, each of them performing a specific task. In our work, we claim that such functionality can be imitated by a set of mathematical functions and operators together with some computational structures.

In general, digital color images are composed of three bands at different wavelengths of light: red, green and blue. In this way, it is possible to transform an image that is represented in the RGB color model into another, such as the CMYK and HSV models. In this paper, the input set of images is defined as follows: I color = {I rI gI bI cI mI yI kI hI sI v} that provides the initial representation of the scene. Next, the input image in I color is transformed by a visual operator (VO) to recreate the feature extraction process of the brain, resulting into a visual map (VM). In our work, each VO is defined as a mapping \({\text{VO}}_d : I_{\rm color} \rightarrow {\text{VM}}_d\) where the domain of the function is I color and the codomain of this operation is a visual map VM d along a particular dimension. Each kind of feature is classified or grouped into several dimensions, such as color, orientation, spatial frequency, brightness and direction of movement; this is according to the theory of Treisman and Gelade [26]. The current work follows the implementation of Itti et al. [36] where the transformed pixel value is represented by the feature prominence along each dimension considering orientation, color and intensity; hence, \(d \in \{O, C, {\text{Int}}\}\). Figure 2 shows that features are extracted sequentially one at a time by applying the corresponding operator VO d . These features are detailed next.

Orientation

The extraction of orientation information on the natural visual system occurs by the action of simple and complex cells at the primary visual cortex (V1), as well as in the V2 layer. These cells are in charge of estimating the sensitivity to the orientation stimulus by decomposing the image into a set of small linear segments at different orientations and scales [45]. From a computational perspective, the operator VOO extracts prominent edges that are present in the image and is defined as follows:

$${\text{VO}}_{\rm O}\hbox{:} I_{\rm color} \rightarrow {\text{VM}}_{\rm O}.$$
(1)

The result of this operation is a visual map VMO for which the pixel values represent the prominence of the feature considering the orientation dimension. Typical systems considered a first processing step that produces simple features modeled as a neural network or through convolutions of the input images with Gabor filters of various orientations, spatial frequencies and scales [29, 34, 36, 37]. The goal is to compute a variety of structures, such as corners, line intersections and parts of objects.

Color

In the natural system, the color is encoded through a set of photoreceptor cells known as cones, which are located at the retina. The human color vision depends on three types of photosensitive molecules: short wavelength or blue sensitive (SWS), middle wavelength or green sensitive (MWS) and long wavelength or red sensitive (LWS) [46]. In this way, the visual pigments consist of the molecules that transform light energy into an electrical impulse. In nature, the yellow color is a special case since it cannot be perceived by the cones but rather at the retinal ganglion cells. Moreover, it is well known that in several modules of the visual cortex, there are cells that respond to color stimulus as in the V1, V2, and V4 layers. On the other hand, on the computational framework these ideas are achieved with the following visual operator:

$${\text{VO}}_{\rm C}\hbox{:} I_{\rm color} \rightarrow {\text{VM}}_{\rm C}.$$
(2)

The result of this operation is an image or visual map VMC containing the prominence in color. In previous work, these pre-attentive mechanisms, sensitive to color, are computed through spatial differences on the image channels using a center-surround contrast; i.e., red–green and blue–yellow color opponencies [34, 36, 38]. The goal is to find high-contrast regions within the image.

Intensity

The intensity is a magnitude that indicates the amount of light striking a photosensitive receptor. Physiologically, human beings have specialized ganglion cells to record this feature. In general, each ganglion cell has a circular receptive field that responds to the intensity of light reaching it [9, 26]. In order to obtain the intensity a simple formula is applied:

$${\text{VM}}_{\rm Int} = \frac{I_{\rm r} + I_{\rm g} + I_{\rm b}}{3},$$
(3)

where I rI g and I b are the red, green and blue bands of the image in the RGB color space [34, 36, 38]. The result of this operation is a visual map VMInt that is directly computed with the pixel values, obtained by the camera, and which represents the intensity defined with the above formula.

Although the acquisition of visual features can be performed in a parallel manner in the implementation of the ADS algorithm, depicted in Fig. 2, a for loop is used to iterate across the dimensions. During one loop of the algorithm, the operator VO d is applied over the input I color. After the construction of each visual map, the system must calculate an additional map known as the conspicuity map for each dimension. The following section is dedicated to the description of such process.

Computation of the Conspicuity Maps

Once the VMs are obtained, the next step is to compute the conspicuity maps (CMs). These maps are obtained by means of a function that is applied to simulate the center-surround (CS) receptive fields. This part of the computational system is inspired from the natural structure that measures the differences between firing rates at the center (c) and surroundings (s) areas of the ganglion cells. In our work, the CMs are obtained following the Walther and Koch model [38]. The computational center-surround mechanism compares the average value of a center region to the average value of a surrounding region in the visual receptive fields [18]. In particular, the center-surround is implemented in the model as the difference between fine and coarse scales using a pyramid of nine scales [36]. This function is defined as follows:

$${\text{CS}}\hbox{:}{\text{VM}}_{d}^A \rightarrow {\text{CM}}_{d} \quad \forall\;d \in \{O, C, {\text{Int}}\},$$

where \(A=\{1,2, \ldots, 9\}\) is the set of scales \(\alpha \in A\), and VM A d is a pyramid for each dimension d that is obtained by employing a Gaussian filter applied to each of the corresponding VMs using σ = α, which can be written as \({\text{VM}}_{d}^{\alpha}=G_{\sigma=\alpha}\star {\text{VM}}_d\) with \(\star\) representing the convolution operation. In this way, the CS mapping is composed of two steps that are defined next.

  • First, an across-scale subtraction \(\ominus\) is performed resulting in a center-surround pyramid VM ω d of six levels; in such a way that the value of the pixel corresponds to a difference of Gaussians from the nine level pyramid VM A d . Thus, the across-scale subtraction is defined as follows:

    $${\text{VM}}_d^{\omega} = N(| {\text{VM}}_{d}^c \ominus {\text{VM}}_{d}^s|) \quad \forall\; d \in \{O, C, {\text{Int}}\},$$

    where \(N(\cdot)\) is a normalization operator, c = {3, 4, 5} and s = {c + 3,c + 4}. Note that for each scale value of c exists two values of s; where \(s,c \subset A\) and the final result is a pyramid with six levels \(\omega = \{1, 2, \ldots, 6\}\) calculated through six across-scale subtractions.

  • Second, the VM ω d pyramids for each dimension perform an across-scale summation \(\oplus\) in order to obtain all conspicuity maps CM d . This can be written as follows:

    $${\text{CM}}_d = N(\oplus^{6}_{\omega=1} {\text{VM}}_d^{\omega}) \quad \forall\;d \in \{O, C, {\text{Int}}\}.$$

At this point, we have a normalized CM for each dimension. Once that the feature extraction and center-surround processes are finished the “for” loop ends, see Fig. 2. The next stage on the ADS algorithm deals with the combination of the CMs into a single saliency map.

Feature Combination and the Saliency Map

The aim of the feature combination process is to fuse the CMs into a single map of salience information. This process highlights a set of locations through a saliency map (SM), resulting in a set of prominent regions within the image. Anatomically, the location of the SM within the brain is unknown. Moreover, within the neuroscience community, a clear description of how the brain makes such an integration is still a subject of discussion. In this way, Koch and Ullman [1] proposed as hypothesis that the SM resides either at the lateral geniculate nucleus (LGN), or at the striate cortex (V1). Along this line of research, several authors have proposed other explanations for the location of the SM; for example the thalamic nucleus and the pulvinar regions proposed in [47], the V1 as described in [48], the V4 region explained in [49], even at the posterior parietal cortex detailed in [50] and finally at the lateral intraparietal area as illustrated in [51].

The feature-integration process is a difficult task since the CMs are part of different and unrelated visual modalities working at the sensory system level. Indeed, we claim that a goal-driven paradigm could serve the purpose of integrating such problematic and apparently different characteristics of the feature-integration process. In this way, most models involving BU and TD factors can be seen as highly relevant regarding the achievement and pursuance of the VA process. Thus, the criteria that guide the search toward the most suitable combination of characteristics should be defined in accordance with the task at hand. Therefore, the integration of CMs is accomplished by a feature-integration operator FI as follows:

$${\text{FI}}\hbox{:}{\text{CM}}_d \rightarrow {\text{SM}} \quad \forall\;d\in \{O, C, {\text{Int}}\}.$$
(4)

Commonly the SM is the result of the average of all CMs. Once the integration of features is achieved, a WTA network is applied to compute the most salient pixel from the resulting SM. Then, a region around the winner, known as proto-object, is delimited using a spread function that is controlled by a threshold. Thus, the proto-object indicates the location of the most prominent region within the original image [52, 53]. Note that an image can contain more than one proto-object; hence, the WTA process should be capable of distinguishing several regions. The feature-integration stage is depicted in Fig. 2.

Brain Programming

The present work follows a function-driven methodology that approaches the biological visual process from the standpoint of its functionality, while paying special attention to the goals. Our method differs to the data-driven approach that is based on a large amount of knowledge, defined by the experts, encoded as a set of patches; where a patch is a small contrasting part of an image. Those patches come from the hard-coded operations that are applied at the different stages of the ADS. The proposed methodology, brain programming, describes a manner in which an ADS can be optimized to imitate the functionality of specialized areas of the brain through a set of operators. This paper presents a function-driven approach to extract and combine the relevant information that solves a specific visual task; in this case the VA problem. In particular, we claim that brain programming, a GP-based strategy embedded into a well-known hierarchical process and combined with key elements such as center-surround mechanisms, including normalization and pyramid scale processes, are able to provide with the suitable tools that are necessary to implement this methodology in highly creative ways.

Clearly, the classical approach of single or multiple tree GP process is insufficient to evolve the whole ADS owing to the high level of complexity. Nevertheless, the approach described in this paper provides a strategy that is able to automatically design such programs or solutions. In this section, we describe the main aspects to implement the BP, a new evolutionary technique, with the aim of designing ADSs for VA problems, which will be further tested against four state-of-the-art TD strategies using a well-known database. Thus, BP is a bioinspired strategy where the major changes with respect to the GP came from the fact that the possible solutions or individuals cannot be defined as an array of trees, but rather as a complex structure composed of several trees and other predefined processes. The motivation for these changes is the idea of emulating the functionality of an organ, such as the brain, and its complexity, see Fig. 3. Our approach is inspired by the idea that the brain is subject to the process of evolution [54]. We introduced these changes in order to deal with the evolution of complex structures to be tested in a challenging classical problem of VA. As a result, BP exhibits the ability to design novel VA programs that achieve a lower false detection rate than those designed by human experts.

Fig. 3
figure 3

The analogy between the natural and artificial systems. The idea is based on the evolution of several mathematical operators, immersed or encapsulated within the artificial dorsal stream, that emulate the functionality of the corresponding natural areas

Genetic Representation of the ADS

In our work, the genetic representation of the ADS, the genotype, consists of a triplet of trees embedded within an ADS. In GP, programs are usually encoded as syntax trees made up of internal and leaf nodes, which are defined by a set of primitive elements also called function set, and terminal set. Thus, the sets of functions and terminals represent the problem search space. Note that we will refer to the genotype only by the triplet of trees although the final result, phenotype, is achieved by a more complex system; in this paper the ADS. Indeed, the genotype looks for a set of key operations that in combination with the whole structure define the desired design by mimicking the operations at different brain areas. Nevertheless, our implementation of brain programming follows a general GP-based approach since the genetic representation is the link to the real-world problem together with the definition of the fitness function. Figure 4 provides the classical diagram of the sequence of operations in an evolutionary algorithm where the evaluation of the fitness function is calculated through a measure applied to the ADS. In this section, we will explain the necessary steps that were taken to define the genetic representation of the ADS. The idea is to evolve the different VOs, as well as the operation in charge of the feature-integration stage. In the rest of the paper, we will refer to such operations as the evolved visual operators (EVOs) and the evolved feature-integration (EFI), resulting into an optimized saliency map (OSM). Note also that this last operation could be related to the concept of automatically defined functions [42] despite the fact that the EVOs working as terminals are not directly feeded into the EFI tree and that the EVOs are evolving within a fix structure.

Fig. 4
figure 4

Flowchart of the brain programming strategy

In this paper, each tree has its own sets of functions and terminals that were carefully chosen according to the desired functionality that we attempt to emulate. The first tree mimics the orientation, or the functionality of the orientation-sensitive cells in V1. Thus, we propose to evolve the operator for orientation (EVOO) through a set of elements specially selected to highlight edges, corners and other orientation-related features using the set of terminals and functions provided in Table 1. The notation is as follows: the input for the functions in F O can be any of the terminals in the T O set, as well as the composition among them; G σ are Gaussian smoothing filters with σ = 1,2; and D u represents the image derivatives along the direction \(u \in \{x, y, xx, yy, xy\}\).

Table 1 Functions F O and terminals T O applied for building the operator EVOO

The second operator encodes the color dimension or the operation of photoreceptor cells and color-sensitive cells present in the V1 and V4 layers of the visual cortex. Again, we propose to evolve the color visual operator (EVOC) to reproduce the color perception process. In fact, there is a theory that explains how color perception has evolved in nature [55]; thus, it makes sense to evolve such process. In the case of artificial evolution, the evolutionary process uses the set of functions and terminals provided in Table 2 to create the EVOC. The notation provided in Table 2 is as follows: the input for the functions in F C can be any of the terminals in T C, as well as the composition among them. Note that some functions of EVOC are the same of those in EVOO. In addition, the function complement() provides a negative image that is the complement of an intensity or RGB image; in other words, each pixel value is subtracted from the maximum pixel value and the difference is used as the pixel value in the output image. Thus, in the output image, dark areas become lighter and light areas become darker.

Table 2 Functions F C and terminals T C used by the operator EVOC

Finally, the third tree encodes the way in which the features are combined to obtain the OSM. This is accomplished with the idea of creating a fusion operator that highlights the features of the objects of interest. Here, the evolutionary method uses the set of functions and terminals listed in Table 3. Note that the set of terminals is completely different to the previous ones, and it incorporates the output of the previous stages of the ADS, not to be confused with the output of the EVOs. The functions are similar to some elements of the function set used in EVOO with the addition of exp() and the equalization of the histogram EQ().

Table 3 Functions and terminals applied within the EFI stage

Figure 5 shows an example of the proposed genotype that we are applying during the evolution. These three functions are embedded into the flow diagram of Fig. 2 and evolved with the evolutionary program outlined in Fig. 4. At the beginning of the BP algorithm, it receives as input the number of generations, the population size and the number of trees to be evolved per individual. All individuals of the initial population are a triplet of random trees indicated by the num_trees parameter. In our experiments, the number of generations, num_gen, is 30 and the number of individuals per population size_pop is also 30. The initialization method that is applied to the population is known as the ramped half-and-half method proposed by Koza [42]. This approach creates half of the initial population with the grow method and the other half with the full method. The grow method produces unbalanced trees allowing branches of different lengths while the full method makes balanced trees, with all branches of the same length. The length of the tree must not exceed the specified maximum to prevent that trees grow up without control during evolution. In our work, the size of the individuals should not exceed the user-defined maximum depth in order to avoid uncontrolled growing of the trees to prevent bloat. The depth of a tree is defined as the length of the longest non-backtracking path from the root to an endpoint. Tree depth limits the size of any given individual within the population and it is dynamically controlled by two maximum tree depths parameters. The first is the dynamic max depth which is a maximum tree depth that may not be surpassed by any individuals unless its fitness is better than the best solution found so far. If this happens, the dynamic max depth is augmented to the tree depth of the new fittest individual. Conversely, it is reduced if the new best individual has a lower tree depth. The second one is the real max depth that is hard limited and which no individual may surpass under any circumstances. Finally, the termination criterion was defined as the maximum number of generations; thus, the evolutionary process reaches at least a suitable top-down model during each single run. Table 4 summarizes the parameter values applied by BP during the experimental runs. These parameters have standard values [42].

Fig. 5
figure 5

Example of the ADS’s genotype

Table 4 Main parameters settings of the BP algorithm

In the following sections, the explanation of the GP-based evolutionary process is further detailed through the definition of the applied genetic operations and the criterion for evaluation. The idea is to outline our approach and illustrate a new way of obtaining VAPs.

Genetic Operations

After the initial population is created, and evaluated with the fitness function, the next step of the BP algorithm is the selection and genetic variation of individuals from the population in order to breed the next generation. This is accomplished through the genetic operations of selection, crossover and mutation, see Fig. 4. The parents are obtained using a fitness-proportionate selection method implemented with the roulette-wheel strategy, which consists in assigning to each individual a probability of selection proportional to their fitness value. Thus, the fittest ADSs are more likely to be selected, and the genetic crossover and mutation are sequentially applied until a new population is created. Note that the difference between GP and BP arises from the genotype representation, the hierarchical representation and special processes such as the Gaussian pyramid and center-surround process. In this way, the existence of a higher level of complexity within the genotype structure motivates the development of new genetic operators acting on different stratum that we call gene and chromosomal levels.

Both methods of crossover need a pair of parents to perform the corresponding operation. In the case of chromosomal crossover, the crosspoint is located at chromosome level and the corresponding parts of the genotype are swapped to produce the new offspring. For the gene crossover, multiple crosspoints are chosen randomly at each parent in the respective trees; then, the selected sub-trees are exchanged between the corresponding parents to create the new offspring; see Fig. 6. On the other hand, the chromosomal mutation is applied to the triplet of trees in such a way of randomly substituting all genes of the compound chromosome with new operators. Finally, the mutation at the gene level consists of the random selection of a node within a single tree, called point of mutation. This selected node is deleted and replaced by an entirely new sub-tree, which is randomly generated to create the new tree. This is carried out for each selected tree of the compound genotype, and the strategy is repeated to produce a new individual and eventually a whole new population; see Fig. 7. Note that if during the gene mutation, the selected node is the root of the tree, the whole tree will be substituted by a completely new random tree.

Fig. 6
figure 6

Schematics that illustrate the crossover operations at the chromosomal and gene levels. a Chromosomal Crossover, b Gene Crossover

Fig. 7
figure 7

Schematics that illustrate the mutation operations at the chromosomal and gene levels. a Chromosomal Mutation, b Gene Mutation

In our work, during the production of the new population, the genetic operators are applied to find new solutions based on a probability that is assigned for each operation with the scheme proposed by Koza [42]. In this case, the operations are computed independently and their total probability adds to one. Hence, the crossover probability at gene and chromosomal levels are equal to 0.4 each; while the mutation probability for both levels is equal to 0.2, see Table 4. These genetic operators allow the variation of the genetic material, while promoting genetic innovation of individuals through all levels, and maintaining the diversity of the population.

Fitness Function or Objective Function

In brain programming like in any evolutionary method, after the population is created, all individuals are evaluated, see Fig. 4. For this reason, it is necessary to formulate a well-posed fitness function in accordance with the goal that the ADS is attempting to reach. In other words, the fitness function of BP represents the characterization of the purpose, the answer to the question “What are the individuals for?” [28]. Thus, like in all optimization approaches, this step defines the way of implementing the goal within a computer program. In this work, we propose to use the F-measure within the BP process, as the fitness function for evaluating the ADS performance for each generated structure. Next, we explain how the fitness function was implemented to approach a VA problem.

This section introduces the F-measure as the criterion for evaluating the ADS within the proposed optimization framework. The F-measure was created in the information retrieval (IR) community. Van Rijsbergen [56] originally proposed this measure through the Effectiveness function (E). The definition is considered as the original formulation of the F-measure, see Equation (5).

$$E = 1 - \frac{1}{\alpha \frac{1}{p} + (1-\alpha)\frac{1}{r}},$$
(5)

where \(\alpha = \frac{1}{\beta^2 + 1}\) and \(0\leq \beta\leq \infty\). In the context of IR, the computation of recall and precision is defined in terms of a set of retrieved documents F and a set of relevant documents R:

$$r = \frac{|F\cap R|}{|R|} \quad and \quad p = \frac{|F\cap R|}{|F|},$$
(6)

where r is recall {r:0 ≤ r ≤ 1} and p is precision {p:0 ≤ p ≤ 1}. Note that the F-measure is the weighted harmonic mean of precision and recall. Moreover, the weights give the advantage of balancing the values of recall and precision. Thus, the general formula of the F-measure is defined as follows:

$$F_\alpha(p,r) = \frac{(1+\alpha)\cdot (p\cdot p r)}{(\alpha \cdot p + r)},$$
(7)

where α is the parameter that controls the balance between p and r with \(0\leq \alpha\leq \infty\). Note that if α < 1 the weight variable p is higher; while if α > 1, then the weight variable r is higher, and when α = 1, the precision and recall are said to be well-balanced.

In previous works, the F-measure has been used as an evaluation criterion for applications related to computer vision. For example, Pérez and Olague [57] proposed the F-measure as a quantitative measure for evaluating evolved SIFT descriptor operators for the classical problem of image matching. In the same year, Gimenez and Evans [58] proposed it as a method of comparison of different segmentation algorithms. Later, Atmosukarto et al. [59] used the F-measure as fitness function for a GP algorithm to learn descriptors of shape variations that are used to rate different facial abnormalities. These examples show that the F-measure has been successfully applied to image-related problems and we decide to test it within a well-known VA problem.

Figure 8 provides a sketch describing the main areas that were applied during the computation of the F-measure. This is applied to Itti’s database that was used as benchmark [14]. Note that a proto-object refers to the selected region issued by the VAP, while manual segmentation denotes the area in the image that has been manually segmented for training and testing. Thus, the selected manual segmentation of an object is considered as an ideal that the evolved models replicating the VA process should seek to attend. For the computation of the fitness function, we consider that the evolved ADS has only one chance for correctly identifying the manually segmented region that serves as ground truth. In order to estimate the VA performance with the F-measure, we compare the evolved proto-object with the manually segmented region. In this way, the relationship between the proto-object with the values of recall and precision must be defined. Thus, the overlapping region between the proto-object and the manually segmented one defines the area where the true positive pixels are located. The remaining areas outside the overlapping region refer to the false-negative and false-positive pixels. These concepts are formally defined next.

Fig. 8
figure 8

The diagram illustrates areas; manual segmentation, proto-object, overlapping region; that are used to compute precision and recall variables

Let M represent the set of pixels of the manually segmented region, and let A be the set of pixels attended by the ADS. Thus, the set of pixels O containing the overlapping region is described through the intersection of both sets O = AM. Once that this relationship is stressed, the next step is to substitute the corresponding variables in Equation (6) to derive the definitions for recall (r) and precision (p) as follows:

$$r = \frac{|O|}{|M|} \quad and \quad p = \frac{|O|}{|A|},$$
(8)

where |X| denotes the number of pixels in \(X \in \{M,A,O\}\). Hence, the following equation represents the criterion Q that is applied as the fitness function:

$$\begin{aligned} Q = {\text{argmax}} \left\{F_\alpha(P_{k,i}^{x},R_{k,i}^{x}) = \sum_{k=1}^{m} \sum_{i=1}^{n}\frac{(1+\alpha)\cdot (p_{i}\cdot p r_{i})}{(\alpha \cdot p_{i}) + r_{i}} \right\},&\\ {\text{where}} \quad Q\hbox{:}F_\alpha(P^{s},R^{s}) \geq F_\alpha(P^{t},R^{t}) &\\ \end{aligned}$$
(9)

with n representing the number of thresholds of the spread function, and m defines the number of training images. Precision data of an image are denoted with \(P^{x}_{k} = (p_{1}, p_{2}, \ldots, p_{n})\) and recall data by \(R^{x}_{k} = (r_{1}, r_{2}, \ldots, r_{n})\); where x represents the set of all possible solutions and \(s,t\subseteq x\). Therefore, Q represents the ascending ranking with the highest value corresponding to the focus of attention model x that best attends the particular object over all training images. Therefore, once again, the F-measure could be seen as a suitable approach whose results demonstrated the easiness and usefulness at the moment of specifying the goal of artificial evolution, which are both reflected on the quality achieved during the final results. In particular, the performance of the solutions is related to the focused regions containing the relevant objects that aim a specific purpose.

Experimental Results

This section presents an analysis and discussion about the experimental results. The experiments were designed considering a state-of-the-art image database composed of three different objects: red can, emergency warning triangle and traffic signals [14]. The databases used for training and testing can be obtained from the USC ilab Footnote 1, as well as the source code of the saliency toolbox Footnote 2. In this paper, each test consists of 30 independent runs of the proposed BP algorithm: the set of images for training and testing, as well as the parameters that were fixed during the whole run. The experiments were performed on a Dell Precision T7500 Workstation, Intel Xeon 8 Core, CPU E5506 at 2.13Ghz, NVIDIA Quadro FX 3800, running Linux OpenSUSE 11.1 operating system.

Image Database

In this paper, the experimental tests consist of three databases of natural color images that are used to compare the BP algorithm against four state-of-the-art methodologies [14]. The first database consists of 104 images containing a red can as target. This object in the first database is considered as the simplest target of the three due to its simplicity since it can be characterized by only one color feature. In fact, the results confirm this hypothesis. For the experiments, we follow the proposed protocol described in [14]. The first experiment uses 45 images for training and 59 images for testing. The second database is composed of 64 photographs containing a vehicle emergency triangle, which can be seen as a more complicated target. The database includes the warning triangle under different orientations, scale factors and illumination. The 64 images are divided in two sets of equal size for training and testing. In the case of the third database, the collection is composed of 90 images containing one or more traffic signals. This database contains a broad variety of targets with diverse colors, shapes, sizes and textures. For this reason, it is considered as a more complicated database. Here, the database is divided in two sets of 45 images. Finally, we would like to mention that the targets in the three databases could be partially occluded, opaque, with multiple backgrounds or suffering from reflection of light when viewed from different perspectives.

Analysis of Results

Next, we present the results of this work accompanied by an interpretation of the obtained data. The results of the BP algorithm show evidence of the relatively easiness at solving all databases, that were used as benchmark, while finding the best solutions in comparison with the four state-of-the-art algorithms [14].

BP Analysis

In this section, some typical statistics about GP are provided: fitness, diversity of population, number of nodes and depth of trees, since those aspects are highly informative and commonly used to describe the whole GP process. Figure 9 depicts the statistics considering 30 runs of the BP algorithm executed during 30 generations. All graphs give an account about the performance and complexity of the population during the training stage.

Fig. 9
figure 9

Brain programming statistics along the execution of 30 experiments grouped by the target’s database: a red can, b warning triangle and c traffic signal. In 1, the fitness chart shows the average, median and best fitness. While 2 plots the diversity of population through the percentage of uniqueness in the EVOs and EFI as well as the Euclidean distance among the population individuals. Finally, 3 and 4 depict the complexity of the structure of the EVOs and EFI based on the amount of nodes and depth, respectively

The first example is known as the red can database problem. Figure 9a.1 presents the graph of best fitness where we can observe the simplicity at solving the red can attention problem. Indeed, we can say that the problem is almost trivial for the ADS, which cannot be considered as a real test from the GP standpoint. Nevertheless, this example serves the purpose of illustrating the functionality of the BP algorithm. Note that from the first generation, the ADS scores a fitness value of 17, which is quite close to the best fitness value of 20. Note that BP converges to the optimal ADS by the fifth generation. Moreover, the average and median population fitness roughly improve through the whole run despite the early convergence around the global optima where multiple local maximum exist from the genotype standpoint. Thus, the optimization provided by the BP algorithm achieves better ADS designs as evolution proceed.

The graph provided in Fig. 9a.2 shows the loss of population diversity along the evolution according to the percentage of uniqueness. In fact, such behavior found on the BP strategy is characteristic of the evolutionary computation methodologies. Thus, controlling this aspect is of major importance since population’s diversity promotes a broad exploration of the search space while preventing the population from prematurely converging to a local optima. In this work, we measure the population diversity among the triplet of trees in the BP population; this is computed through the percentage of unique EVOs and EFI. The diversity graph depicts a decreasing trend on the level of uniqueness to 15 % probably due to early convergence. Moreover, diversity was also measured in the fitness space through the Euclidean distance among individuals. In this case, between the fitness values of the ADSs whose measure takes into account the fact that behavior for different programs can be very similar or even identical. As a result, the plot illustrates that while the uniqueness of EVOs decreases significantly the total difference about the ADSs performance remains constant.

One severe problem in GP-based strategies is related to the tree representation where programs transform into larger and slower structures without any improvement in their performance, while producing also a problem on their generalization ability; this is known as the bloat problem. Figure 9a.3 and a.4 plot the scored complexity measured by the number of nodes and depth according to the tree structures. Here, we can observe in the BP experiment that the complexity remains constant during the whole evolutionary run, while keeping the average value around the 2.0 level per tree; far below the dynamic maximum depth value of 7. Hence, the overall structure allows an improvement on the measured fitness function. We consider that the problem of bloat is minimized since our proposed design includes the hierarchical structure and several well-known principles of cognitive neuroscience. In this way, the methodology does not need to find the whole algorithmic process for solving a given task.

After examining the results achieved by BP on the emergency warning triangle database at Fig. 9b, it is possible to observe that the behavior regarding the fitness plot is similar to that achieved on the red can database. Nevertheless, in this case, the convergence toward the maximum fitness value for all runs is achieved by the 13th generation. Furthermore, the average and median fitness improve smoothly along the run, which is the expected result for BP, see Fig. 9b.1. In the same way as in previous experiments, diversity is calculated through the uniqueness measure of EVOs that decreases rapidly toward a value around the 25 %, where the Euclidean distance remains almost the same during the whole evolutionary run. Moreover, in terms of complexity, Fig. 9b.3, b.4 shows that it is possible to observe a slight decrease in the number of nodes and depth level for all operators along the 30 generations which is an unusual behavior for a GP-based strategy, while reaching an average value between 1.83 and 3.32 for the nodes, and of 1.8 and 2.29 for the depth levels.

The results achieved by the BP strategy on a third database including traffic signals are shown in Fig. 9c. Contrary to previous experiments, and even when the average best fitness maintains an increasing rate the evolution does not converge toward the maximum fitness. In this occasion, the fitness scores an average maximum value of 16.98. The last fitness improvement was obtained during the 27th generation. Moreover, Fig. 9c.2 presents several levels of diversity that are clearly separated among them, with the lowest percentage of uniqueness scoring 28.77 % for EVOO, 25.11 % for EVOC and 38.33 % for EFI. Meanwhile, the Euclidean distance achieves a small gain. Thus, regarding the level of complexity, see Fig. 9c.3, c.4: the EVOC shows a diminution in average toward 2.43 for the number of nodes and 1.86 for the depth level; the EVOO shows an increment in average toward 3.5 for the number of nodes and 3.5 for the depth level; and finally, the EFI of 6.63 on the number of nodes and 4.3 for the depth level.

These results show that the behavior of the BP algorithm is characterized by the image database. As a result, the expected behavior of the BP strategy follows the regular trend toward the best solution of optimization approaches, with the exception of complexity that seems to avoid the typical intricate designs of standard GP approaches. In fact, these results point to the enforcement of the parsimonious principle where the most effective solutions are also the simplest ones. Next, the final solutions are analyzed from the standpoint of their effectiveness. We then evaluated the performance of the proposed model in well-known visual attention experiments using a standard testbed. Finally, we repeat the above experiments with reduced sets of functions and terminals.

Effectiveness of Solutions

The best individuals obtained through the BP runs were tested with different sets of images; this process is known as the testing stage. In these experiments, each data were randomly divided into two subsets with a balance number of images, the first subset called the training example was used to learn the EVOs and EFI, and the second subset was used for testing the proposed model. All images are the originals of the standard test, and the performance was measured as the number of false detections computed after the failure in the identification of the most salient target [14]. All experiments were tested on 30 independent runs, and the average and standard deviation are reported.

In this section, the results of the best individuals for each BP run are listed and compared against the solutions obtained with four previous approaches developed by Itti and Koch [14], in order to study the problem of integrating the optimal features, produced through different visual modalities, into a single SM. Such work studied four strategies for feature combination: naive model, model with non-iterative normalization N(.), model with iterative normalization and trained model, see Table 5. Moreover, in order to develop a comparison of the results, we adopt the testing sets and measurement criterion about the effectiveness of solutions as reported in [14]. In this way, the false detection rate is computed for the corresponding target as well as the standard deviation. Note that a target is considered as detected if the subset of pixels, being different from the empty set, falls within the attended region also known as proto-object.

Table 5 The false detection rate achieved after finding the target (mean ± standard deviation)

The results at the top of Table 5 provide the outcome of the four strategies tested with three different databases. According to these results, the best strategy is the trained model. Now, the results achieved by the BP algorithm are also sorted, and the best, that actually outperform the trained model, are highlighted in boldface. Moreover, the best solutions for each database are highlighted in italic-boldface and an asterisk. Finally, there were some problems in which the system was unable to find a solution for some specific images during the testing stage. Those results have a superscript that provides the precise number of images where the testing failed.

According to Table 5, the following data could be inferred. Firstly, regarding the red can target, the BP process finds a better solution in 66.66 % of the runs in comparison with the trained model, see Fig. 10a. Moreover, the best ADS achieved its highest score during the 14th test-run, being discovered during the third generation. In fact, this solution scores a false detection rate of 0.08 that corresponds to 5 out of 59 images. Furthermore, the ADSC14 does not record a single false detection during the training stage. Now, after studying the warning triangle results, the BP obtains a better solution in 70 % of the times in comparison with the trained model, see Fig. 10b. Here, the best ADST26 was acquired at the 26th trial and eighth generation. The result achieves a performance with a rate of 0.06 false detections, in other words, just 2 attempts out of 32 testing images, without a single false detection during the training stage. Finally, we observe the difficulty of detecting the traffic signals since the images are more complicated. Indeed, the best discovered solutions were better than the trained model in 13.33 % of the runs, see Fig. 10c. In this way, the best solution was discovered in the 22nd run at the 14th generation. As a result, the BP algorithm found the best ADSTS22 with a false detection rate of 0.13 that is equal to six fails out of the 45 testing images. Contrary to previous results, the ADSTS22 achieves a higher score of 17.33 during the training stage in comparison with the perfect fitness score of 20.

Fig. 10
figure 10

The charts provide the false detection rate for the best individuals along 30 executions

As a conclusion, BP was able to find, throughout 30 runs, better solutions than those obtained in previous research. Next, we describe the structural analysis of the solutions.

Structural Analysis of Solutions

In this section, the solutions are analyzed from the structural standpoint by describing the distribution of functions and terminals through the histograms of frequency of use, see Fig. 11. Our goal was to identify the best components of the trees for three visual attention tasks. An important question is: which functions and terminals have more influence on the VA task? We study the final best individuals along the 30 runs. Thus, the functions and terminals are counted only if their respective EVO is combined by the EFI. Hence, the best solutions are listed in Tables 6, 7 and 8, according to the visual dimensions that were selected when finding a solution to the problem.

Fig. 11
figure 11

The charts show the statistics about the frequency of use of functions and terminals computed from the best individuals along 30 executions. Note that the terminal graphs, T O and T FI , do not include the derivatives since all of them were counted as functions

Table 6 The selection of 6 out of the 30 best ADSC classified according to the structural information
Table 7 The selection of 7 out of the 30 best ADST classified according to the structural information

In the experimental results about the red can, highlighted in red color, the best ADSC for VA utilize the color feature 70 % of the time; see Fig. 11. This result was expected since color is considered as the main characteristic of the red can; followed by the orientation at the rate of 52 %. Furthermore, according to the color feature, the most useful terminal was I m at a rate of 68 % of the time, while the complement() function was applied in only 20 % of the time. On the other hand, in terms of orientation, the I m was again the most useful terminal appearing in 40 % of the ADSs, with similar results for the functions D x D y and G σ=1.

Final results about the warning triangle solutions ADST, highlighted in yellow, converged to the functions and terminals along the orientation and color dimensions. Note that 80 % of the ADST utilize orientation while 50 % color information combined with the functions D x D y , | + |,  + and × . As a result, the most prominent dimension was orientation. Indeed, the EVOO was basically a composition of \(D_x, D_y, G_{\sigma = 2}, \times, \div\) and ()2 functions; together with the terminals I m and I h , see Fig. 11. We can observe that the results about the traffic signals confirm that 97 % of the ADSTS are based on orientation and that the EVOO apply the terminals I m and I s , as well as the functions D x , ()2, | |, √ and G σ=1. The blue bars in Fig. 11 illustrate these results. Hence, we can say that the proposed optimization process is able to find good visual operators that extract the main characteristics according to the task at hand.

In this work, the best ADSs can be organized in terms of the most useful characteristics that were applied during the EFI stage. Table 6 presents three pairs of ADSC that exhibit the highest performance in the “can" experiments. Note that all operators base their process on the orientation and color dimensions. Note also that intensity was never selected among the optimal solutions for the red can problem. Thus, the first two individuals use the magenta band since it matches better the color of the red can, while D x was applied probably due to the vertical position of the can across all images in the testbed. On the other hand, the ADSC11 and ADSC22 provide local minima that are very similar in structure, since they are both based on I m and D x terminals. Note that the best ADSC14 applies a subtraction between D x and D y computed with their corresponding CMs.

In this way, we can observe in Table 7 the selection of seven ADST that were able to focus reliably on the emergency warning triangle. Nevertheless, we have only six ADSs since the genotype or structural composition of ADST10 and ADST19 is identical. Note that both avoid the color dimension. Thus, we can state that the algorithm discovered a local minimum whose overall performance scores identical results. Furthermore, the best solution ADST26 is also based on orientation, and its similarity to the previous ones is striking. Note that ADST26 has the same fitness of ADST10 and ADST19, but it is slightly different in terms of the structural composition since at the moment of performing feature-integration the individual applies a new derivative to the conspicuous map. These results consider that the fitness function is evaluated in one attempt of the focus of attention, see “Fitness function or objective function” section. Nevertheless, we can say that ADST26 is better since it has a lower false detection according to Table 5. Hence, the best individual ADST26 is a combination of D x and D y over the magenta band; while the ADST3 and ADST7 also apply a nonlinear combination of functions working with the magenta color. The application of I m , and the D x and D y functions is also used on individuals ADST0 and ADST5 in combination with red and hue bands. Finally, the overall best individual in this test was the ADST26.

In the case of the traffic signals, the results are more complex than the two previous ones. Note that the database for the traffic signals includes a wider variety of images; hence, the final solutions are more complex. In the final results, we can appreciate that some particular functions and terminals have been picked up more frequently. Also, among the 30 best ADSTS no one is exclusively based on the color feature. Therefore, we select only four ADSs that are written on Table 8. The first two, based on orientation, the ADSTS12 and ADSTS22, are basically created through a combination of I m D x and D y terminals, as well as the Gaussian function. Note that ADSTS22 is the best discovered solution as well as the simplest one. Moreover, the next two ADSTS have a higher complexity and both use color and orientation data, despite the fact that color information consists only of the blue band. These ADSs are few examples of solutions that the BP process is able to find. Note that the best solutions do not necessarily include all VOs because the EFI could exclude them. In the following section, we provide as examples three ADSs to visualize the information flow as a way of illustrating the inner workings of the best attention process for each image database, see Figs. 12, 13 and 14.

Table 8 The selection of 4 out of the 30 best ADSTS classified according to the structural information
Fig. 12
figure 12

Some details about the inner workings of the best individual ADSC14, which rightly focus the coke in the image database

Fig. 13
figure 13

Some details about the inner workings of the best individual ADST26, which rightly focus the emergency warning triangle in the image database

Fig. 14
figure 14

Some details about the inner workings of the best individual ADSTS22, which rightly focus the traffic signals in the image database

Final Designs for the Three Databases

This section contains three examples discovered with our optimization methodology that illustrate the functionality of the best ADS designs using a diagram where some key aspects about the structure are illustrated.

Figure 12 depicts the ADSC14 that is used as a first example; where it is possible to observe the following noteworthy results. The magenta color band is used in EVOO and EVOC to filter a big amount of unnecessary information known in the literature as distraction elements or the image regions that can mislead the location of the proto-object. This process allows the system to highlight more accurately the red can. Moreover, the EVOO applies a D x as a way of suppressing the horizontal data while keeping the red can information. Next, the center-surround function used within the algorithm task eliminates the foreground information along the regions with high contrast. As a result, we obtain two conspicuity maps, CMO and CMC, that are able to highlight and discriminate the red can region as the most prominent of the image. Afterward, the feature-integration stage consists of the absolute value computed from the difference across the main directions D x and D y . Finally, the result leads us to the OSM that indicates the desired location achieved by the focus of attention system.

The next example corresponds to the warning triangle problem. Figure 13 shows the information flow process for the ADST26. Again, the solution works through the magenta color band. Note that this is mainly due to the color of the emergency warning triangle. In this way, distractions like the white car in the background are eliminated. On the other hand, since there are multiple red color objects, the partial results obtained after the visual maps show several regions that could become more prominent after the center-surround function. Nevertheless, besides the color information, the EVOO extracts the orientation information through the application of two derivatives D y and D x accordingly. In this way, the information about the color and orientation is applied to produce results that highlight the main characteristics of the triangle. As a result, the EFI operator combines the orientation along the main directions of the image resulting in an OSM where the most salient region corresponds precisely to the exact place of the emergency triangle. Indeed, ADSC14 and ADST26 have a very similar way of working.

Finally, the last example illustrates the best individual discovered for the traffic signal problem. Figure 14 shows the ADSTS22 working on a typical example. In this regard, the ADSTS22 utilizes exclusively the orientation operator EVOO. Once again, the magenta was the color band that the evolutionary algorithm selected for the solution of the VA regarding the traffic signals. Moreover, EVOO applies two D x to remove horizontal features and preserve the vertical position of the traffic signals. After the application of the center-surround function, resulting in the CMO, we can distinguish several conspicuous regions. However, the EFI operator subtracts twice the CMO with the \(\sqrt{CM_O}\), which can be interpreted as a filter that helps to decrease the noise level of the CMO. Hence, the OSM presents a lower number of salient regions, and more importantly, it focuses on the traffic signal region.

Final Experiments with Reduced Sets of Functions and Terminals

In this section, we decide to proceed further with the analysis of the results achieved during the previous experiments related to the structural composition and the effectiveness of solutions. As we already mentioned, the best individuals obtained through the proposed strategy do not use all the functions and terminals that were proposed empirically. Therefore, we tested the BP algorithm with reduced sets of functions and terminals based on the previous results. Tables 9, 10 and 11 provide the sets of applied and discarded functions and terminals for each experiment. Note that the discarded sets represent the elements that were never used by any of the 30 best solutions of the corresponding test. However, there are patterns that emerge from these results. For example, it is noteworthy to observe that the three experiments arose different results; with the exception on the selection of the EFI stage, since all terminals were used and the functions at least once. Note also that independently of the three tests the following sets of functions \(\{+, \times, \div, |+|, |-|, log_{2}()\}\) and terminals {I c } were never used by the EVOO. Moreover, the set of functions {| + |, | − |} was never used by the EVOC.

Table 9 Sets of selected and discarded functions and terminals for the red can database, where \(I_{\alpha_1} \in \{ I_{\rm r}, I_{\rm b}, I_{\rm m}, I_{\rm y}, I_{\rm h} \}\) and \(I_{\alpha_2} \in \{ I_{\rm g}, I_{\rm c}, I_{\rm k}, I_{\rm s}, I_{\rm v} \}\)
Table 10 Sets of selected and discarded functions and terminals for the warning triangle database, where \(I_{\alpha_1} \in \{I_b, I_m, I_y, I_k, I_h\}\) and \(I_{\alpha_2} \in \{I_r, I_g, I_{c}, I_{s}, I_{v}\}\)
Table 11 Sets of selected and discarded functions and terminals for the traffic signals database, where \(I_{\alpha_1} \in \{ I_{\rm r}, I_{\rm g}, I_{\rm b}, I_{\rm m}, I_{\rm y}, I_{\rm s}, I_{\rm v} \}\) and \(I_{\alpha_2} \in \{ I_{\rm c}, I_{\rm k}, I_{\rm h} \}\)

Figure 15 shows the results of applying a new series of experiments to test the learning process of BP with the described reduced sets. Note that for the three experiments, all the statistics were significantly improved, specially those related to the fitness function. Thus, the average and median fitness present a lower standard deviation; while for the case of the traffic signals, the results show an improvement on the best fitness.

Fig. 15
figure 15

Brain programming statistics along the execution of 30 experiments with the most utilized functions and terminals, grouped by the target’s database: a red can, b warning triangle and c traffic signal. 1 shows the graph of best, average and median fitness. 2 plots the diversity of population through the percentage of uniqueness in the EVOs and EFI; as well as the Euclidean distance among the population individuals. Finally, 3 and 4 depict the complexity of the structure of the EVOs and EFI based on the amount of nodes and depth, respectively

Finally, from the structural standpoint, we can say that in this last series of experiments, the evolutionary process converges toward the local minima reported in “Structural analysis of solutions” section. For example, in the red can experiment, a similar solution to ADSC11 appears once and two similar solutions to ADSC22 were found along with one copy of it. This phenomena also occurs for the warning triangle, where the exact same individual ADST10 was rediscovered in three occasions. Contrary to these results, in the traffic signals test, neither identical solutions nor similar individuals were found. Table 12 shows the structure of the best individual obtained during the 18th run at the 27th generation of this last experiment. This solution scores a false detection rate of 0.11 that correspond to five fails out of the 45 testing images. Moreover, this individual achieves the highest score of 19.11 during training.

Table 12 The best triplet of trees achieved with the reduced set of functions and terminals for the traffic signals database

Conclusion

This study presented a novel GP-based algorithm that we called brain programming and a new VA model based on the artificial dorsal stream, which through the BP algorithm is able to design a suitable program for the problem of attending a specific target object. The BP algorithm is a new optimization/search approach that looks for the best way of identifying the visual features and the right way of combining them to highlight the necessary visual information for the task at hand. We conducted a visual search experiment using three different image databases in order to test our approach. The analysis of the results shows that BP is an efficient tool to optimize ADSs whose results outperform previous man-made VA models on a standard testbed.

Based on our experimental results along 30 evolutionary runs, we observe that BP is able to provide suitable and coherent designs, while maintaining the simplicity in the solutions according to the task at hand. In this way, regarding the EVOs that are embodied in the ADS structure, we found that they fulfilled the parsimony principle. In other words, the simplest solution is always the one offered by the system since the system is able to find the local minima.

In addition, we found interesting properties in the EVOs. For instance, we observe that some EVOs are highly specialized toward the extraction and combination of the main characteristics of the object of interest. Thus, for the red can attention problem, the EFIs focused on the application of orientation and color dimensions, being color the most prominent feature for the red can problem. Meanwhile, the EVOO utilizes the magenta color band and the D x function. This is understandable since magenta is closer in the color space in comparison with the “red” color of the can, while D x enhance the appearance of its vertical position. On the other hand, the principal dimensions of the emergency warning triangle are also color and orientation. Here, most of the EFIs selected the orientation dimension: the EVOO utilizes D x and D y to highlight both vertical and horizontal edges while magenta band is applied to highlight the color of the can. Finally, regarding the traffic signals, the vast majority of EFIs are based once again on the orientation dimension EVOO, which is mainly conformed of the I m terminal and the D x function. Moreover, the traffic signals have several colors and shapes, and for this reason, it is more difficult for BP to select suitable features. In summary, our results suggest that through the application of the BP strategy, it is possible to automatically design VA systems for focusing a specific target.

In future work, this research offers several new avenues. Firstly, we plan to test the BP strategy for the ADS problem with a bigger database of natural images in combination with pop-out search images. Secondly, the proposed strategy allows us to extend the ADS model by the aggregation of new feature dimensions such as movement, disparity or shape. Finally, from the evolutionary computing standpoint we would like to incorporate other evolutionary techniques, such as niching, in order to study the problem of diversity.