1 Introduction

Deep convolutional neural network (DCNNs) have been extremely successful in a wide range of computer vision applications, rivaling or exceeding human benchmark performance in key visual challenges such as object and face recognition (He et al. 2015; Sun et al. 2015; Jiang et al. 2022) or scene categorization (Stivaktakis et al. 2019). However, state-of-the-art DCNNs require too much energy, computation, and memory to be deployed on most computing devices and embedded systems (Goel et al. 2020). In contrast, the brain is masterful at representing real-world objects with a cascade of reflexive, largely feedforward computations (DiCarlo et al. 2012) that rapidly unfold over time (Ales et al. 2013; Cichy et al. 2016) and rely on an extremely sparse, efficient neural code (for a recent review see Beyeler et al, 2019). For example, in macaques, faces are processed in localized patches along the Superior Temporal Sulcus (STS), where cells detect distinct constellations of face parts (e.g., eyes, noses, mouths), and whole faces can be recognized from a linear combination of neural responses within these face patches (Chang and Tsao 2017; Majaj et al. 2015).

In recent years, spiking neural networks (SNNs) have emerged as a promising approach to improving the efficiency and biological plausibility of neural networks such as DCNNs, due to their potential for low power consumption, fast inference, event-driven processing, and asynchronous operation (Gerstner and Kistler 2002; Stuijt et al. 2021). To facilitate learning in such networks, new learning algorithms based on varying degrees of biological plausibility have also been developed recently. For instance, spike-timing-dependent plasticity (STDP) is an unsupervised learning rule that is observed in biological systems (Bi and Poo 1998; Caporale and Dan 2008; Falez et al. 2019) and that can be used to extract the most notable spike patterns (Feldman 2012; Brzosko et al. 2019; Hao et al. 2020) by adjusting the efficacy of synaptic connections based on the relative timing of presynaptic and postsynaptic spikes. Studying how object recognition may be implemented using biologically plausible learning rules in SNNs may not only further our understanding of the brain, but also lead to the development of energy efficient systems, implementable on neuromorphic hardware.

Here we present a SNN model that uses spike-latency coding (Chauhan et al. 2018, 2021) and winner-take-all inhibition (WTA-I) (Maass 2000) to efficiently represent visual stimuli using multi-scale parallel processing. Part of this work (Sanchez-Garcia et al. 2022) was previously presented at the CVPR’22 NeuroVision workshopFootnote 1. Given an input image, stimuli were preprocessed with parallel spatial frequency (SF) channels mimicking the sensitivity of neurons in early visual cortex (De Valois et al. 1982a). The resulting combination of the SF channels was then fed to a layer of spiking neurons whose synaptic weights were updated using STDP (Gütig et al. 2003b). We show that STDP can learn efficient object representations from the MNIST (LeCun 1998), FASHION-MNIST (Xiao et al. 2017), CIFAR10 (Krizhevsky and Hinton 2009), and ORL (Samaria and Harter 1994) datasets. In addition, we investigate how the quality of the represented objects changes under different SF bands and WTA-I schemes. Remarkably, our network is able to represent objects with as little as 200 neurons and 15 spikes per neuron.

The rest of the paper is organized as follows: Sect. 2 briefly introduces some of the most recent related works. Section 3 explains the main framework and the model equations. Next, we report the results of a computational study in which we explored the quality of the represented objects and the sparsity trade-off for the different networks schemes (see Sect. 4). Finally, a brief Discussion summarizes the main results and gives some perspectives in Sect. 5.

2 Related work

Significant efforts have been expended in recent years to demonstrate the efficacy of SNNs with STDP in object recognition applications (Vigneron and Martinet 2020; Liu et al. 2020; Fu and Dong 2021). Previous studies have used STDP to extract visual features of low or intermediate complexity from images and without supervision. Yu et al. (2013) proposed a novel SNN with a supervised learning rule and temporal coding scheme to generate temporal spike patterns, which could be used to classify a subset of handwritten digits found in the MNIST database. Liu and Yue (2016) combined Gabor filter banks with rank-order coding and STDP to push the MNIST classification rate to 82%. Beyeler et al. (2013) achieved 92% on MNIST using a Calcium-based STDP learning rule, which was later surpassed by Diehl and Cook (2015) using standard STDP and lateral inhibition. Masquelier and Thorpe (2007) used the STDP rule in an asynchronous feedforward SNN that mimics the ventral visual pathway and showed the emergence of selectivity to intermediate-complexity visual features when the network was presented with natural images.

More recent articles designed a deep SNN, comprising several convolutional and pooling layers trainable with either standard STDP (Kheradpisheh et al. 2018) or reward-based STDP (Mozafari et al. 2019). Bing et al. (2019) used a supervised reward-modulated STDP learning rule to train two SNN-based sub-controllers on obstacle avoidance tasks. Zhou and Li (2022) proposed a SNN with STDP learning and first-spike coding to extract object features from Gabor filters and even-driven convolutions.

Studying how object recognition may be implemented using biologically plausible learning rules in SNNs may not only further our understanding of the brain, but also lead to new efficient artificial vision systems.

Numerous studies in visual neuroscience demonstrated the existence of multiple channels, or multiple receptive field (RF) sizes, in early visual cortex and their implications for the processing of the spatial frequency (SF) content of images during object recognition (Kauffmann et al. 2014; Ginsburg 1986; Field 1987; Tolhurst et al. 1992; Hughes et al. 1996). Because RFs of neuronal populations in the visual pathway vary in size, the responses of different subsets of neurons would constitute a neural representation at some particular scale, allowing us to represent visual scenes as a combination of SF channels (Campbell 1973).

Fig. 1
figure 1

Multi-scale network, illustrated using images from the ORL dataset (Samaria and Harter 1994). Images were convolved with ON and OFF center/surround kernels to simulate LGN responses. To simulate the multiple SF channels of the visual system, we used a pre-processing scheme where LGN maps were obtained from spatial filters at low, medium and high spatial frequencies (further illustrated in Fig. 2). The three LGN responses were added, converted to spike latencies, and fed to one layer spiking neural network (SNN) of firing-rate neurons with plastic synapses implementing spike-timing-dependent-plasticity (STDP) and winner-take-all inhibition (WTA-I). The propagated LGN spikes contributed to an increase in the membrane potential of V1 neurons until one of the V1 membrane potentials reached threshold, resulting in a postsynaptic spike and inhibition of all other V1 neurons until the next iteration. Objects were reconstructed by taking a linear combination of spiking activity across the V1 population

Selectivity for SF is one of the fundamental and most thoroughly studied properties of visual neurons (Henriksson et al. 2008; Shapley and Lennie 1985; De Valois et al. 1982b). The primary visual system processes low-level and high-level stimulus properties using inputs from the retina via the lateral geniculate nucleus (LGN). In the earliest stages of the visual pathway, the processing of different stimulus attributes occurs in a parallel fashion. This means that images are filtered by parallel, SF-selective channels (Enroth-Cugell and Robson 1966), which may converge in V1 (Nassi and Callaway 2009). The visual information from the LGN passes through V1 and multiple strategies might be used to transfer parallel input into multiple output streams.

3 Methods

Fig. 2
figure 2

LGN preprocessing. To simulate the computations performed by the retinal ganglion cells and the LGN, the images were convolved with ON and OFF center-surround kernels (Chauhan et al. 2018). Specifically, we chose three sizes based on an earlier study (Chauhan et al. 2018): 0.375\(^{\circ }\)/0.75\(^{\circ }\) for low SF, 0.25\(^{\circ }\)/0.5\(^{\circ }\) for medium SF and 0.125\(^{\circ }\)/0.25\(^{\circ }\) for high SF (Solomon et al. 2002). The resulting images processed with these filters correspond to low-scale, medium-scale and high-scale LGN maps, respectively

Fig. 3
figure 3

Example RFs of three representative neurons (columns in each panel) of the simulated population for low-scale, medium-scale, high-scale and multi-scale networks (rows). With STDP, neurons progressively learned features corresponding to prototypical patterns that were both salient and frequent

3.1 Network architecture

The network architecture of our model is shown in Fig. 1. Inspired by Chauhan et al. (2018), our network consisted of an input layer corresponding to a simplified model of the LGN, followed by a layer of spiking neurons whose synaptic weights were updated using STDP. The LGN layer consisted of simulated firing-rate neurons with center-surround RFs, implemented using DoG filters which simulate the computations performed by the retinal ganglion cells and the LGN (Enroth-Cugell and Robson (1966); Derrington and Lennie (1982); further illustrated in Fig. 2). Based on Chauhan et al. (2018), the RF sizes were chosen to reflect the size of representative LGN center-surround cells. It is well known that the SFs of these neurons can differ by about a factor of 3. Some cells are therefore tuned to high SFs, while others are tuned to low SFs (Derrington et al. 1979). Here, we used the three following sizes of center-surround RFs: 0.375\(^{\circ }\)/0.75\(^{\circ }\) for low SF, 0.25\(^{\circ }\)/0.5\(^{\circ }\) for medium SF and 0.125\(^{\circ }\)/0.25\(^{\circ }\) for high SF (see Solomon et al, 2002). These values corresponded to the widths of the Gaussian used for the DoG filter.

The SF curves for the LGN images were thus fitted using a DoG model defined as follows:

$$\begin{aligned} \text {LGN}_\text {ON}= & {} \frac{1}{2 \pi \sigma ^{2}_\text {center}} e^{-\frac{\hat{x}^2}{2\sigma ^{2}_\text {center}}} - \frac{1}{2 \pi \sigma ^{2}_\text {surround}} e^{-\frac{\hat{x}^2}{2\sigma ^{2}_\text {surround}}} \end{aligned}$$
(1)
$$\begin{aligned} \text {LGN}_\text {OFF}= & {} - \frac{1}{2 \pi \sigma ^{2}_\text {center}} e^{-\frac{\hat{x}^2}{2\sigma ^{2}_\text {center}}} + \frac{1}{2 \pi \sigma ^{2}_\text {surround}} e^{-\frac{\hat{x}^2}{2\sigma ^{2}_\text {surround}}} \nonumber \\ \end{aligned}$$
(2)

where \(\text {LGN}_\text {ON}\) and \(\text {LGN}_\text {OFF}\) were the LGN maps, \(\hat{x}\) was the input image, and \(\sigma _\text {center}\) and \(\sigma _\text {surround}\) were the center-surround standard deviations used for the SF scales. The outputs of these filters, respectively, led to low-scale, medium-scale and high-scale images which were subsequently added together and converted into spikes using an intensity-to-latency conversion (Delorme and Thorpe 2001). These spikes were transmitted to the V1 layer, which was composed of integrate-and-fire neurons fully connected to the outputs of the LGN (see Fig. 1). In addition to this multi-scale architecture, we also developed an approach based on lateral scales, which is detailed in Appendix 7.

Fig. 4
figure 4

Multi-scale network. (a) Reconstruction error (MSE) of test set. (b) Spike count per neuron: number of spikes fired by an active neuron. (c) Lifetime sparsity: active stimuli during the lifetime of a neuron. (d) Population sparsity: neurons active at any point in time. Mean responses and standard deviation grouped by type of network (low-scale, medium-scale, high-scale and multi-scale). Error bars have been averaged across neurons for lifetime sparsity and averaged across images for population sparsity. \(*** = p < .001\); \(** = p < .01\); \(* = p < .05\); \(ns = p > .05\). All t tests paired samples, two-tailed

Table 1 Global results for type of networks

3.2 Neuron model

The membrane potential \(E_{n}(t)\) of the n-th V1 neuron at time t within the iteration was given by:

(3)

where \(t_{m}\) was the spike time of the m-th LGN neuron, H was the Heaviside or unit step function, and \(\theta \) was the threshold of the V1 neurons (assumed to be a constant shared by the entire population). The expression \(\min \{ t \mid \max E_{n}(t) \ge \theta \}\) denoted the timing of the first spike in the V1 layer. Membrane potentials were calculated up to this point in time, after which a WTA-I scheme (Maass 2000) was triggered and all membrane potentials were reset to zero. In this scheme, the most frequently firing neuron exerted the strongest inhibition on its competitors and thereby stopped them from firing until the end of the iteration.

3.3 Spike-latency code

Following Chauhan et al. (2018), we converted the LGN activity maps to first-spike relative latencies using a simple inverse operation: \(y = 1/x\), where x was the LGN input and y was the assigned spike-time latency. Any monotonically decreasing function would lead to equivalent results (i.e., where the most active units fire first, while units with lower activity fire later or not at all) (see (Masquelier and Thorpe 2007)). In this way, we ensured that the most active units fired first, while units with lower activity fired later or not at all.

3.4 Spike-timing-dependent-plasticity

The weights of plastic synapses connecting LGN and V1 were updated using multiplicative STDP, which is an unsupervised learning rule that modifies synaptic strength, w, as a function of the relative timing of pre- and postsynaptic spikes, \(\Delta t\) (Gütig et al. 2003b). LTP (\(\Delta {t} > 0\)) and LTD (\(\Delta {t} \le 0\)) were driven by their respective learning rates \(\alpha ^{+}\) and \(\alpha ^{-}\), leading to a weight change (\(\Delta w\)):

$$\begin{aligned} \Delta {w}={\left\{ \begin{array}{ll} -\alpha ^{-} \cdot w^{\mu ^{-}} \cdot K(\Delta {t},\tau _{-}), \Delta {t} \le 0 \\ \alpha ^{+} \cdot (1-w)^{\mu ^{+}} \cdot K(\Delta {t},\tau _{+}), \Delta {t} > 0, \end{array}\right. } \end{aligned}$$
(4)

where \(\alpha ^{+} = 5 \times 10^{-3}\) and \(\alpha ^{-} = 3.75 \times 10^{-3}\), \(K(\Delta {t}, \tau ) = e^{-\vert \Delta {t}\vert /\tau }\) was a temporal windowing filter, and \(\mu ^{+} = 0.65\) and \(\mu ^{-} = 0.05\) were constants \(\in [0, 1]\) that defined the nonlinearity of the LTP and LTD process, respectively. STDP has the effect of concentrating high synaptic weights on afferents that systematically fire early, thereby decreasing postsynaptic spike latencies for these connections.

In this implementation, computation speed greatly increased by making the windowing filter K infinitely wide, which is equivalent to assuming \(\tau _{\pm } \rightarrow \infty \) or \(K = 1\) (Gütig et al. 2003a). A ratio \(\alpha ^{+} /\alpha ^{-} = 4/3\) was chosen based on previous experiments that demonstrated network stability (Masquelier and Thorpe 2007). Also, Chauhan et al. (2018) showed that the results were robust to variations of this ratio. The threshold of the V1 neurons was fixed through trial and error at \(\theta = 20\). This value was unmodified for all experiments.

Initial weight values were sampled from a random uniform distribution between 0 and 1. After each iteration, the synaptic weights for the first V1 neuron to fire were updated using STDP (Eq. 4), and the membrane potentials of all the other neurons in the V1 population were reset to zero. The STDP rule was active only during the training phase.

3.5 Winner-take-all inhibition

We used a hard WTA-I scheme such that, if any V1 neuron fired during a certain iteration, it simultaneously prevented other neurons from firing until the next sample (Maass 2000). This scheme computes a function WTA-I\(_n\): \(\mathbb {R}^{n}\rightarrow \lbrace 0, 1 \rbrace ^n\) whose output \(\langle y_1, \ldots , y_n \rangle = \) WTA-I\(_n\) (\(x_1,\ldots , x_n\)) satisfied:

$$\begin{aligned} y_i ={\left\{ \begin{array}{ll} 1, \,\,\,\, \hbox {if} \quad x_i > x_j\quad \hbox {for all} \quad j \ne i \\ 0, \,\,\,\, \hbox {otherwise}. \end{array}\right. } \end{aligned}$$
(5)

For a given set of n different inputs \(x_{1}, \ldots , x_{n}\), a hard WTA-I scheme would thus yield a single output \(y_{i}\) with value 1 (corresponding to the neuron that received the largest input \(x_{i}\)), whereas all other neurons would be silent. Sanchez-Garcia et al. (2022) showed that a hard WTA-I scheme was essential for enforcing competition among neurons, which led to sparser object representations and lower reconstruction error compared to softer WTA-I schemes.

3.6 Stimulus reconstruction

The activity map \(\xi _{j}\) of the i-th V1 neuron was estimated as follows:

$$\begin{aligned} \xi _{j} \approx \sum \limits _{j\in {LGN}}w_{ij}\psi _{j}, \end{aligned}$$
(6)

where \(\psi _{j}\) was the RF of the j-th LGN afferent, and \(w_{ij}\) was the weight of the synapse connecting the j-th afferent to the i-th V1 neuron.

Stimuli k were then linearly reconstructed from the V1 population activity:

$$\begin{aligned} OR_{k}= \sum \limits _{j\in {V1}}r_{kj}\xi _{j}, \end{aligned}$$
(7)

where \(r_{kj}\) was the response of the j-th V1 neuron to the k-th image and \(\xi _{j}\) was its activity map. Reconstruction error for an image k was calculated as the pixel-wise mean square error (MSE) between the LGN (\(LGN_k\)) and the V1 activity maps \(OR_k\).

Fig. 5
figure 5

Representative object representation (OR) examples using low-scale, medium-scale, high-scale and multi-scale networks (columns). The number below each image indicates the reconstruction error (MSE) for that particular image. The black frame highlights the image with the smallest error

Fig. 6
figure 6

V1 neurons. (a) Reconstruction error (MSE) of test set using different number of V1 neurons: 100, 200, 400 and 600. (b) Spike count per neuron: number of spikes fired by an active neuron. (c) Lifetime sparsity: active stimuli during the lifetime of a neuron. (d) Population sparsity: neurons active at any point in time. Mean responses and standard deviation grouped by type of network architecture (low-scale, medium-scale, high-scale and multi-scale). Error bars have been averaged across neurons for lifetime sparsity and averaged across images for population sparsity

3.7 Sparsity

We computed a sparsity metric for the population activity in the network schemes according to the definition of sparsity by Vinje and Gallant (2000). On average, we measured how many neurons were activated by any given stimulus (population sparsity) and for all active neurons, how many stimuli any given neuron responded to (lifetime sparsity), as can be seen in Eq. 8).

$$\begin{aligned} \textrm{sparsity} = \left( 1 - \frac{1}{N} \frac{(\sum _{n=1}r_{i})^{2}}{\sum _{n=1}r_{i}^{2}} \right) \bigg / \left( 1 - \frac{1}{N} \right) , \end{aligned}$$
(8)

For population sparsity, \(r_i\) was the response of the i-th neuron to a particular stimulus, and N was the number of model neurons. For lifetime sparsity, \(r_i\) was the response of a neuron to the i-th stimulus, and N was the number of stimuli. Population sparsity was averaged across stimuli, and lifetime sparsity was averaged across neurons (Beyeler et al. 2016). We also calculated the average number of spikes per stimulus.

Table 2 Global results for V1 neurons
Fig. 7
figure 7

Object representation with Multi-scale network varying the number of V1 neurons: 100, 200, 400 and 600 neurons. The number below each image indicates the reconstruction error (MSE) for that particular image. The black frame highlights the image with the smallest error

Fig. 8
figure 8

WTA-I schemes. (a) Reconstruction error (MSE) in the test phase as a function of the number of spikes included in the STDP algorithm (WTA-I) for 200 V1 neurons. (b) Spike count per neuron: number of spikes fired by an active neuron. (c) Lifetime sparsity: active stimuli during the lifetime of a neuron. (d) Population sparsity: neurons active at any point in time. Mean responses and standard deviation grouped by the WTA-I schemes. Error bars have been averaged across neurons for lifetime sparsity and averaged across images for population sparsity

3.8 Dataset

To demonstrate the generality of our approach, we assessed the ability of our SNN network to represent visual stimuli from the MNIST (LeCun 1998), FASHION-MNIST (Xiao et al. 2017), CIFAR10 (Krizhevsky and Hinton 2009) and ORL (Samaria and Harter 1994) datasets. MNIST is a dataset of handwritten digits and consists of 60,000 training patterns and 10,000 test patterns. FASHION-MNIST is a dataset of Zalando article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example of both, MNIST and FASHION-MNIST, is a \(28 \times 28\) grayscale image, associated with a label from 10 classes. The CIFAR10 database consists of 60,000 \(32 \times 32\) color images in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images. The ORL database of faces contains 400 images from 40 distinct subjects. The size of each image is \(92 \times 112\) pixels, with 256 gray levels per pixel.

We enlarged images from the CIFAR10 and ORL database using data augmentation with different orientations of the original images to match the data size with MNIST and FASHION-MNIST datasets.

3.9 Statistical analysis

Data were analyzed using two-way ANOVA and post hoc test with Tukey’s method to evaluate simultaneously the effect of the two grouping variables (Dataset and Networks/WTA-I schemes/V1 neurons) on the following response variables: reconstruction error (RE), spike count/neuron (SC), lifetime sparsity (LS), population sparsity (PS), and recognition time with \(*** = p<.001\); \(** = p<.01\); \(* = p<.05\) and \(ns = p\ge .05\). For the reconstruction error, we have used the mean squared error (MSE) which is the most widely used metric reference and the Structured Similarity Indexing Method (SSIM) which compare the structural and feature similarity measures between reconstructed and original images on the basis of perception.

Table 3 Global results for WTA-I schemes

4 Results

4.1 Object representation using multi-scale network

The performance using a single-scale (i.e., low-scale, medium-scale, or high-scale networks) and multi-scale network is summarized in Fig. 4. The results show the reconstruction error, lifetime sparsity, population sparsity and spike count per neuron (mean ± standard deviation) achieved on the test sets for all databases (see Table 1). The reconstruction error for the four networks (low-scale, medium-scale, high-scale and multi-scale) is shown in Fig. 4a. We found similarity between the reconstruction errors of the three single networks (low-, medium- and high-scale) for all datasets, with some slight discrepancy in the more complex CIFAR10 and ORL datasets. Interestingly, the use of multi-scale manages to further reduce the reconstruction error, being the same trend for all datasets. We also performed a test to determine if the mean difference between networks are statically significant using two-tailed test with a significant level \(\alpha =0.05\). The analysis of the average reconstruction error reveals a significant difference between networks (low-/multi-scale, medium-/multi-scale and high-/multi-scale). Examples of object representations for all datasets can be found in Fig. 5.

Figure 4b shows the number of spikes per neuron needed for object representation. The number of spikes needed to represent an object decreased with the Multi-scale scheme compared to low-, medium- and high-scale networks. On the other hand, we found that the CIFAR10 and ORL dataset, which we considered two of the most complex of the four datasets, needed the highest number of spikes per neuron for all networks.

Figure 4c shows the number of distinct stimuli the neuron responds to during the lifetime of a neuron. The multi-scale network showed a higher number of active stimuli for all datasets compared to the single networks. Moreover, we found significant differences between the networks, being more significant for medium-/multi-scale and high-/multi-scale. The same trend was found for the population sparsity, where the multi-scale presented more active neurons than the low-, medium- and high-scale networks and significant differences were found between them (see Fig. 4d).

Fig. 9
figure 9

Object representation using different WTA-I schemes, where between 1 (harder WTA-I 1) and 200 (softer WTA-I 200) neurons were active for each training sample. The number below each image indicates the reconstruction error for that particular image. Target and prediction images were normalized in [0, 1]. The black frame highlights the image with the smallest error in each row

4.2 Object representation using multi-scale network with varying number of V1 neurons

Figure 6a shows the reconstruction error after training for the test set using different numbers of V1 neurons (see Table 2). We found that the reconstruction error went through a minimum (at roughly 200 V1 neurons) for all databases, which is consistent with the bias-variance dilemma (Beyeler et al. 2019). It seems that using a larger number of neurons with our multi-scale network leads to overfitting and a less sharp reconstruction, as can be seen in Fig. 7.

In addition, the number of neurons needed to represent an object increased with the number of V1 neurons, nearly tripling the spikes from 200 to 400 neurons and quintupling from 200 to 600 (Fig. 6c). Increasing the V1 population beyond 200 neurons did therefore not lead to any visible benefits in reconstruction error (Fig. 7). We therefore limited our V1 population to 200 neurons for all subsequent simulations and analyses.

4.3 Object representation using soft WTA-I schemes

We also tested object representation using various soft WTA-I schemes, where we varied the number of V1 neurons allowed to be active for each training image (see Fig. 8). Figure 8a shows the reconstruction error on the test set across the range of possible WTA-I schemes, ranging from hard (where only one neuron was active per image) to soft (where all 200 neurons were active).

We found that the softer the WTA-I scheme, the higher the reconstruction error (see Table 3). The reason for this became evident when we visualized the resulting object representations (Fig. 9). WTA-I schemes where at most 10 neurons were allowed to be active were instrumental in maintaining competition among neurons. In the absence of a strong WTA-I scheme, multiple neurons ended up learning similar visual features, which resulted in poor object reconstruction (right half of Fig. 9). Also, due to this overlap between neurons, the final feature set was quite limited.

We also found that both the active stimuli during the lifetime of a neuron and the active neurons increased with the number of V1 neurons allowed to be active during training (see Fig. 8c, d). Furthermore, the number of spikes needed to represent an object showed the same trend (Fig. 8b).

5 Discussion

In this work, we have proposed an SNN model that uses spike-latency coding and WTA-I to efficiently represent visual stimuli using multi-scale parallel processing. In particular, this paper developed an extension of earlier work (Chauhan et al. 2018, 2021; Sanchez-Garcia et al. 2022) to investigate how the quality of the represented objects changes under different schemes of the primary visual system processing with subsets of neurons tuned to different SF scales.

We found that the multi-scale network outperformed all three single-scale networks across all datasets (Fig. 4), sacrificing sparsity for a lower reconstruction error. However, it is interesting to note that the multi-scale network used the smallest average number of spikes per neuron (Fig. 4b) across all datasets, indicating that it favored a code where many neurons were weakly activated. In all cases, the learned receptive fields (Fig. 3) were in agreement with nonnegative sparse coding (NSC), which is an efficient population coding scheme based on dimensionality reduction and sparsity constraints that promotes sparse and parts-based population codes (Beyeler et al. 2019).

We also studied how the number of V1 neurons in the network affected reconstruction error and sparsity of the learned population code. In agreement with previous work on NSC (Beyeler et al. 2016, 2019), we found that the reconstruction error (on the test set) goes through a minimum as a function of network size (Fig. 6a). This minimum is though to indicate the optimal model complexity according to the bias-variance dilemma, that is, the point at which the model’s generalization error is minimized. Curiously, this “sweet spot” was found to be at roughly 200 V1 neurons for all tested datasets (Fig. 7). On the other hand, sparsity increased monotonically with network size (Fig. 6b–d), which is more in line with the traditional sparse coding literature (Olshausen and Field 1997).

We also implemented various soft WTA-I schemes to investigate how the quality of represented objects changed (Fig. 8). The WTA-I soft schemes consisted of 10, 50, 100, and 200 (i.e., all) neurons firing during a given iteration, while all other neurons were silent. We found that the softer the WTA-I scheme, the larger the reconstruction error (Fig. 8a) and the number of spikes needed to represent an object (Fig. 8b). The reason for this became clear when we visualized the resulting object representations (Fig. 9). In the absence of a strong WTA-I scheme, multiple neurons ended up learning similar visual features, thus resulting in poor object reconstructions (Fig. 9).

Although our network was able to efficiently represent images from various datasets, an important issue that we did not address in this paper is a comparison with other SNNs with other forms of STDP (e.g., with an additive instead of a multiplicative rule) and/or to SNNs trained with other learning scheme (e.g., SNNs trained with the surrogate gradient). In addition, a future extension of the model might focus on deeper architectures with parallel processing with multiple scales and more challenging visual stimuli.

6 Conclusion

In conclusion, we have shown that a network of spiking neurons tuned to different SFs can represent objects with as little as 15 spikes per neuron using spike-latency coding and WTA-I. WTA-I schemes were essential for enforcing competition among neurons, which led to sparser object representations and lower reconstruction errors. Studying how object recognition may be implemented using biologically plausible learning rules in SNNs may not only further our understanding of the brain, but also lead to new efficient artificial vision systems.

7 Comparison between multi-scale and lateral-scale network architectures

Fig. 10
figure 10

Lateral-scale network. Images from the ORL dataset (Samaria and Harter 1994) were convolved with ON and OFF center-surround kernels to simulate responses in the LGN. We used three LGN sub-networks processed based on a particular SF: low-scale, medium-scale and high-scale (see Fig. 2). The three LGN responses were converted to spike latencies and fed to a SNN each, resulting in three lateral SNN with plastic synapses implementing STDP and WTA-I. The reconstructed images resulted of the three lateral networks were added at the end for the object reconstruction

Fig. 11
figure 11

(a) Reconstruction error (MSE) of test set using multi-scale and lateral-scale networks. (b) Number of spikes per neuron needed for the object representation using multi-scale and Lateral-scale networks. (c) Lifetime sparsity: active stimuli during the lifetime of a neuron. (d) Population sparsity: neurons active at any point in time. \(*** = p<.001\); \(** = p<.01\); \(* = p<.05\); \(ns = p>.05\). All t tests paired samples, two-tailed

Fig. 12
figure 12

Object representation for multi-scale and lateral-scale network architectures using 200 V1 neurons. Two examples of object representation (image A and image B) for multi-scale and lateral-scale architectures and for the four databases. The lateral-scale scheme recognizes some finer details in the image compared to multi-scale, where the image details are coarser. The number below each image indicates the reconstruction error (MSE) for that particular image. The black frame highlights the image with the smallest error

Table 4 Global results for multi- and lateral-scale

We propose another network architecture called ‘lateral-scale’ that also uses parallel processing of multiple scales (see Fig. 10). In this case, the LGN preprocessing is the same as in the multi-scale network architecture, but now the three LGN responses were converted to spike latencies and fed to a SNN each, resulting in three lateral SNN with plastic synapses implementing STDP and WTA-I. The reconstructed images resulted of the three lateral sub-networks were added at the end of the training for the object representation.

As shown in Fig. 11a, the lateral-scale network results in a lower but very similar reconstruction error than the proposed multi-scale network. This may be because the lateral-scale scheme recognizes a few more details corresponding to fine details in the image (see Fig. 12). Lateral-scale was not significantly better than multi-scale if we refer to the representation of objects (see Fig. 12 but used significantly more spikes (Fig. 12b). The number of spikes required for reconstruction increases by approximately double spikes/neuron in some datasets. One drawback in lateral-scale network is that we are training three lateral sub-networks, that means three times more trainable weights.

See Table 4.