Keywords

1 Introduction

Object recognition is a fundamental task for humans and all living beings endowed with the sense of sight, since it allows the interaction of the organism with the surrounding environment and the understanding of a given object. In the last two decades computer vision and evolutionary algorithms have seen a growing interest from the scientific community [1]. In general, the human visual system is able to recognize and classify an object according to its category with ease. Both tasks consider that the set of attributes or features extracted from the images are general enough to classify the object as part of the category, while maintaining in memory the features that serve to identify that particular object within a given scene [2,3,4]. The aim of this paper is to present a new model inspired by the transformations that take place within the visual cortex for the solution of the object classification task.

Fig. 1.
figure 1

Conceptual Model of the Artificial Visual Cortex. The color image is decomposed into four dimensions (color, orientation, shape and intensity). Then, a hierarchical structure is charged of solving the object classification problem through a function driven paradigm.

Ungerleider and Mishkin in 1982 proposed the existence of two routes in the visual cortex. These pathways have been called dorsal and ventral streams, the functionality of the dorsal stream provides the location of an object within the scene, while the ventral stream is dedicated to the task of object recognition. Thus, efficient visual functionality is achieved by a high interchange of information between the two streams [5,6,7,8,9,10]. In this way, object recognition involves processes performed along the dorsal stream such as selectivity that is defined as the ability to filter unwanted information, and those performed in the ventral stream in charge of describing the objects. Thus, the approach proposed in this work is suggested by a computational model based in these two information streams of the visual cortex. This approach differs from those of the state-of-the-art where a data-driven principle is applied using a set of patches – image regions – while creating a dictionary of visual words like in a bag-of-words approach [11,12,13,14,15]. In our work, the first hypothesis is that the dictionary of visual words can be replaced by a set of visual operators which are built with a group of mathematical functions. The second idea is based on the integration of properties responsible for the visual attention process – or selectivity – that is related to the creation of conspicuity maps (CMs) and the center surround process together with description and combination of maximum responses executed by a max operation of the functions that select features, which categorize the object; see Fig. 1. In this way, this model sees the brain as a collection of functions joined into a structural unit that serves to the purpose of object recognition, where the functionality of each area or layer in the visual cortex is represented by a kind of mathematical function, and the interconnection among them is given by the hierarchical structure of the model. Hence, each compound mathematical function mimics the functionality of its natural counterpart as a way of designing a set of virtual brain areas, called Artificial Visual Cortex (AVC). Therefore, the object categorization for the presence/absence problem is achieved through the application of the correct combination of functions within the AVC; an approach that we are calling brain programming [16,17,18].

In this manner, brain programming is defined by an evolutionary cycle. In this paradigm, the optimization problem is defined as the search of multiple parts embedded on a hierarchical system known as the AVC model, which plays a key role on the representation of the solutions that are more complex than a single syntax tree. In this kind of systems, we must encapsulate the key parts of the hierarchical structure in order to evolve them. Hence, by integrating the evolved operations within the complex structure we were able to synthesize solutions for difficult problems; in this case, the object recognition problem [31].

The applicability and efficiency of this methodology has been described in several works [16,17,18,19,20,21], nevertheless, an individual analysis of the AVC model and the brain programming methodology has not been done. As in many optimization paradigms, we propose to compare brain programming with a random search approach, in order to characterize the benefits of the AVC model by itself and the improvement brought by the evolutionary approach. This paper focus on the random search and we will explore the whole algorithm in a future article.

1.1 Research Contributions

This paper outlines the following research contributions:

  • First, in the proposed approach the total number of visual operators made of mathematical functions and embedded within the hierarchical structure can be discovered through a small number of random trials, while actually achieving outstanding results on an standard testbed. This article provides evidence that the hierarchical structure plays a significant role on the solution of visual problems.

  • Second, a comparison of the random search with state-of-the-art algorithms and the whole evolutionary cycle gives us a clear picture of the benefit of applying brain programming.

1.2 Related Work

Most of the works are divided in two basic approaches, the first is regarding visual attention conducted along the dorsal stream, while a second approach is related on object recognition held in the ventral stream. Nevertheless, there are a few works that attempted to integrate the two approaches. For example, Fukushima in 1987 implemented a hierarchical neural network that serves as a model for selective attention and object recognition [22]. In this case, when several patterns are presented simultaneously, the model performs a selective attention to each one, segmenting it from the rest and recognizing it separately. Afterwards, Olshausen et al. in 1993 defined a biologically plausible model that combines an attentional mechanism with an object recognition process to form position and scale invariant representations in the visual world [23]. Then, Walther et al. suggested a combined model for spatial attention and object recognition [24]. In their work, visual attention follows the computational model proposed by Itti and Koch [15] and object recognition is achieved through the HMAX model of Riesenhuber and Poggio [25]. Their information stream follows the whole visual attention process and the final saliency map is fed into the S2 layer of the HMAX model to accomplish the task of object detection. This model was applied to the problem of recognizing artificial paperclips. Next, Walther and Koch in 2007 suggested, with a computational model, that features learned by the HMAX model used for the recognition of a particular object category may also serve for top-down attention tasks [26]. Finally, Heinke and Humphteys applied a model called SAIM for visual search involving simple lines and letters [27]. This model, in a first stage selects the object within the image and subsequently performs an object identification step using a template matching technique.

In our work, we propose a hierarchical model following the preattentive stage of visual attention described in [28] in order to locate the conspicuity regions within the image. Then, a description process is performed using the max operator in combination with a series of functions that emulate the functionality of the V4 area in the visual cortex. This approach differs from traditional models for object recognition [4, 12, 13, 25,26,27] where a set of patches – or visual words – are used to identify the object. In contrast, in our proposed approach the discovered functions provide the functionality of multiple patches; hence, helping in the creation of a straightforward process as will be shown in the experimental results.

Fig. 2.
figure 2

Schematic representation of the computational algorithm whose output is a label that represents the membership to a specific class.

2 The AVC Algorithm

In the natural system the interrelation between the layers of the visual cortex is not fully understood; nevertheless, the functionality at each stage has been described on previous works. Figure 1 depicts the proposed model based on these processes. The AVC is divided in two main parts. In the first stage the proposed system executes the acquisition and transformation of features. Then, in a second stage the AVC performs description and classification associated to the studied object.

2.1 Acquisition and Transformation of Features

The first step of our algorithm is represented by the image acquired with the camera, whose natural counterpart is the retina. Here, the system considers digital color images in the RGB color model, which are later transformed into the CMYK and HSV color models; see Fig. 2. In this way, the color image is decomposed into multiple color channels. The idea is to build the set \(I_{color} =\{I_r,\) \( I_g,\) \( I_b,\) \( I_c,\) \( I_m,\) \( I_y,\) \( I_k,\) \( I_h,\) \( I_s,\) \( I_v\}\), which corresponds to the red, green, blue, cyan, magenta, yellow, black, hue, saturation, and value components of their respective color models and which are used to provide the initial representation of the scene. In our work, the input images in \(I_{color}\) are transformed by four visual operators (VOs) applied independently to emphasize specific image features. The transformations are performed to recreate the feature extraction process of the brain; resulting into a visual map (VM) per dimension [28].

2.2 Feature Dimensions

The VOs are defined with the aim of classifying specific image features along several dimensions: color, shape, orientation and intensity; hence, \(d \in \{C, S, O, Int\}\). Figure 2 shows that features are extracted sequentially one at a time by applying the corresponding operator \(VO_d\).

2.3 Center Surround Process

The center surround method is based on the functionality of the ganglion cells that measure the difference between the firing rates at the center and surrounding areas of their receptive fields. The goal of this process is to generate a conspicuity map (CM) per dimension according to the model proposed in [29]. The algorithm consists of a two step process where the information is built to emulate its natural counterpart as follows; see Fig. 2. First, the computation of the CMs is modeled as the difference between fine and coarse scales, which are computed through a pyramid of nine levels \(P_d^{\sigma } =\) \( \{P_d^{\sigma =0},\) \( P_d^{\sigma =1},\) \( P_d^{\sigma =2},\) \( P_d^{\sigma =3}, \) \( \ldots ,P_d^{\sigma =8}\}\). Each pyramid is calculated from its corresponding \(VM_d\) using a Gaussian smoothing filter resulting in an image that is half of the input map size and the process is repeated recursively eight times to complete the nine level pyramid. Second, the pyramid \(P_d^{\sigma }\) is used as input to a center surround procedure to derive six new maps that result from the difference between some of the pyramid levels that are calculated as follows.

$$Q_d^j=P_d^{\sigma = \lfloor \frac{j+9}{2}\rfloor +1} - P_d^{\sigma = \lfloor \frac{j+2}{2}\rfloor +1} ,$$

where \(j=\{1,2,\ldots ,6\}\). Note that the levels of \(P_d^{\sigma }\) have different size and are scaled down to the size of the top level to calculate their difference. Next, each of these six maps are normalized and combined into a unique map through the summation operation, which is then normalized and scaled up to the \(VM_d\) maps’ original size using a polynomial interpolation to define the final \(CM_d\).

2.4 Description and Classification Stage

After the construction of the CMs, the next stage along the AVC is to define a descriptor vector that will be used as input to a support vector machine (SVM) model for classification purposes.

2.5 Computation of the Mental Maps

In this stage of the process a single set of visual operators is used to produce a mental map (\(MM_d\)) per dimension; see Fig. 2. After the computation of the conspicuity maps a set of visual operators \(VO_{MM}\) is applied with the aim of describing the image content. Note that the proposed visual operators are homogeneous and independently applied to each feature dimension. This operation is defined as follows:

$$\begin{aligned} MM_d=\sum _{i=1}^{k}(VO_{MM_i}(CM_d)), \end{aligned}$$
(1)

where d is the dimension index and k represents the cardinality of the set \(VO_{MM}\). Each summation is applied to integrate the output of all operations \(VO_{MM_k}\) to produce a \(MM_d\) per dimension. Thereafter, the four Mental Maps are concatenated into a single array and the n highest values are selected to define the vector \(\overrightarrow{\nu }\) that describes the image.

In contrast to our proposal, the state-of-the-art methodologies [11,12,13,14,15] are based on a template matching paradigm with the goal of learning a set of prototype image patches. Traditionally, the idea is to learn such a set by using what is known as the bag-of-words model, which is applied to identify a given object category. In this way, our approach substitutes the set of templates with the set of visual operators to characterize one object class with excellent results as we will show in the experiments.

3 Experiments and Results

We use the CalTech 5 and CalTech 101 image databases, despite many serious concerns raised about them [30, 31]. Nevertheless, that test is still widely used in the object recognition community and thus most state-of-the-art algorithms report their classification results [12, 13, 32,33,34,35,36].

3.1 Methodology to Obtain an AVC Solution

The methodology that was used to generate the AVC programs followed the algorithm of Sect. 2, where an important step is the construction of VOs. These operators consist of syntax trees made of internal and leaf nodes, which are defined by a set of primitive elements also called function set (see Table 1) and the terminal set defined by the domain of each function. In our work, each tree has its own sets of functions and terminals that were carefully chosen according to the desired functionality that we attempt to emulate within the AVC. All VOs were generated through a random procedure with a maximum depth of 5 levels, where half of the trees were balance trees and the other half were constructed as arbitrary trees adding nodes until the maximum depth is reached.

Table 1. Functions for the visual operators (VOs).

The proposed methodology for designing AVCs to study the absent/present classification problem is divided in three steps. The first two steps define the training stage while the last one is devoted to the testing stage. In this way, all image databases were randomly divided into three subsets for each class; one per step. This process is detailed next.

  1. 1.

    The first step starts by randomly generating a set of VOs to be used inside the AVC structure. Then, it proceeds to the training stage of the SVM using the images from the first subset, called training-A. As a constraint, if the SVM achieves a given threshold in classification accuracy during its training, the process continues to step 2; on the other hand, the VOs together with the SVM are discarded and the process is restarted.

  2. 2.

    At step 2 the system uses the set of VOs found in step 1 but it trains a new SVM with the second image subset, called training-B. Once again, if the SVM scores the given threshold in classification accuracy the process continues to step 3 and the AVC structure is considered as the solution; on the other hand, both VOs and SVM are discarded and the search continues at step 1.

  3. 3.

    In the last step, the best AVC structures are tested by classifying the third image subset. The testing is performed with the estimated SVM from step 2 and the VOs from step 1. The whole process is repeated until the best set of solutions are discovered.

Finally, all experimental results are provided in the following sections.

Table 2. Total number of random runs needed to discover 100 solutions per class for all subset sizes.
Fig. 3.
figure 3

Sample images from CalTech-5 database, and the category background from CalTech-101 database.

3.2 Experimental Evaluation of the AVC for Classification of Color Images

In a first experiment, the performance of the proposed model was evaluated through a binary test using five classes from the Caltech-5 database in combination with the Google background of Caltech-101, see Fig. 3. The goal is to analyze the effect on the recognition performance by using training sets of different sizes. Thus, the AVC model was trained with randomly selected positive images used to define the training-A subset of size: 1, 10, 20, 30, 40, 50, 60, and 70; while using a constant subset of 50 negative images for all experiments. In the case of one positive training image and after 7500 randomly possible evaluations an AVC was never found; hence, it was discarded from further tests. Thus, the numbers of images selected for training-B were set to 50 positive images and 50 negative images. In this way, Table 2 provides the number of random runs that were necessary to discover the solutions. This experiment was repeated until 100 AVCs were found for each training-B subset producing a total of 700 solutions with 100% accuracy during the training stage per class. All these solutions were tested and the mean and standard deviation are reported in the following section.

3.3 Testing the Performance of the Random Search

Table 3 presents the summary of the experiment showing the average, standard deviation, maximum and minimum performance for the testing stage. All results were normalized between 0 and 1 in such a way that 1 represents 100% of classification accuracy. The best solutions were obtained for the airplanes, faces and leaves classes scoring 95%, 99% and 97% respectively; while in the case of cars and motorcycles classes the best solutions scored a classification accuracy of 77% and 75% respectively. Note that these final scores are similar regardless of the subset size that is applied during the training stage. Moreover, the solutions whose scores are highlighted in bold at Table 3 are provided with their corresponding formulae in Table 4.

Table 3. This table shows a summary of the results of the AVC testing which were obtained with a random search and using the color category background from CalTech-101 database.
Table 4. This table shows the best solutions that were discovered after a random process.

3.4 Comparison Between the AVC and HMAX Models

The HMAX model was used in a second series of tests based on the experimental design proposed in [12], in order to compare our results with the state-of-the-art. Thus, once again, the solutions from the first experiment were tested, in the object present/absent experiment, with a new random set of images considering 50 positive images for the object classes selected earlier, as well as 50 negative images selected from the Caltech-5 background database. The aim is to investigate the effect on the 700 final solutions per class using the recognition performance based on the accuracy. Note that for this test the background images are in gray scale; hence, the color components of the image were initialized with the same value. The results summary is shown in Table 5. The comparison between our model and the HMAX model is provided in Table 6. We report the error rate at equilibrium point as the measure performance in these experiments. For the sake of showing that the differences between the performances of the proposed AVC and the HMAX-SVM models are statistically significant, we used two non-parametric statistical tests: the Wilcoxon rank sum [37] and a two-sample Kolmogorov-Smirnov test [38]. These last experiments were tested on the 30 best random solutions out of the 700 found for each class.

Table 5. This table summarizes the classification results achieved on testing using the background Caltech-5 database as the negative class. Note that the performance is better than the previous experiment, since the background is built by gray tone images.
Table 6. This table shows a comparison of the performance achieved among the HMAX model, considering the boost and SVM classifiers, and the AVC model. Note that in the case of the HMAX model a learning process was applied in order to identify the best patches. However, for the AVC model only a random sampling was used to discover the best solution.

Using the whole evolutionary process, we present in Table 7 the solutions that were discovered by the brain programming. The selection process was implemented following the roulette-wheel strategy, which consists in assigning to each individual a probability of selection proportional to its fitness value. Termination criteria was defined using a a maximum number of generations; 30 in this case and 30 solutions per generation. Thus, the aim is that each evolutionary process reaches an optimal AVC program at each single run. Note that the performance of each solution in testing is 100% in classification accuracy.

Table 7. This table shows some solutions that were discovered after of evolutionary process of the brain programming.

4 Conclusions and Future Work

This paper presented a computational model of the visual cortex following the hierarchical structure of previous visual attention and object recognition proposals. As a result, the proposed methodology replicates the initial stages of the artificial dorsal stream using four dimensions: color, shape, orientation and intensity, in combination with the final stages of an artificial ventral stream, which are used to approach the task of object categorization. The overall approach considers that the process of visual information extraction and description can be enforced by function composition through a set of mathematical operations that are used in the aforementioned stages. According to the results all functions embedded within the hierarchical structure of the AVC can be easily discovered through random search while achieving excellent results on the Caltech database. In this sense, we presented examples that illustrate the behavior of the discovered AVCs for the problems of faces and motorcycles of the Caltech categorization problem. Finally, we provide a comparison with the HMAX model that is considered as the reference for these kind of approaches and found that the AVC model was superior according to the results obtained for the Caltech testbed.

As a conclusion, we can say that the AVC methodology offers a new perspective to study the development of artificial brains since the structural complexity can be improved because the approach is susceptible of being framed as an optimization problem. This article provides some results of the whole evolutionary cycle for the studied database. Indeed, the results score perfect accuracy except for one case. In this way, we can synthesize new structures according to the task at hand. In particular, as future research we would like to test the approach with more complex datasets, such as the GRAZ and the VOC challenge datasets. Additionally, since the methodology is computationally costly, we propose to undergo a change towards a parallel computing implementation of the AVC model through the application of GPGPU technology [39]. Finally, we would like to explore the application of this new paradigm to problems of humanoid robotics.