Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

As most of our knowledge and information comes from the visual system, visual working memory plays an important role in our cognitive process. However, although hundred billion (\(10^11\)) neurons and several hundred trillion synaptic connections of human brain can process and exchange prodigious amounts of information over a distributed neural network in the matter of milliseconds, the information load we can process in a short time (within seconds) is limited [1]. It is not clear whether the limited physical resources is the major cause of visual attention, in the experiment for both human brain and primate brain, the dual-interference between simultaneous spatial attention task and spatial memory task show the competition of physical resource [2]. Even though working memory and selective attention were manipulated in two separate and unrelated tasks, the interference between them is obvious, the distractors of visual input is crucially determined by the availability of working memory [3]. In EEG study and other kind of psychological experiments, the increased image complexity or distractors induces increased brain activity [3].

In our daily life, we have to read information from different media, either through electronic media like web site or from paper publications. Most of these information are accessed through our visual system to the brain. However, the visual information that the viewer has paid attention to may not be the useful information that the author want to show or the viewer is unpleasant with the publication due to high level of visual complexity [4]. This visual complexity caused mismatch in serious situation can even induce car accident [5], which is largely due to the dual-task interference of memory and attention [2, 6]. An algorithm which is based on SIFT and K-means algorithm has been designed to estimate this mismatch (namely the distance between the expected region of “visual conspicuity” and the real salient areas) as a metric of visual complexity [7]. As visual complexity (or image complexity called somewhere) is an important reason for this mismatch, the relationship between the saliency mismatch and the visual complexity estimator is close. Furthermore, it has been proved in cognitive experiment that the image saliency and attention priority can determine the visual working memory capacity, where the image information is stored [8]. The mismatch between the image targets and the interesting points of an image thus can be proved to have effect on our visual working memory. Until now, there have been hundreds of papers which have discussed the saliency detection of images [7]. A comprehensive survey of saliency detection algorithm is not intended here. The reason for us to employ SIFT is because of its popularity in computer vision applications and in saliency modelling [911]. In addition, SIFT is similar to the information process of inferior temporal cortex and has good image feature descriptor which is scale-variation free [9]. In later part of this paper, results of this SIFT & K-means algorithm are compared with two other saliency map of [12, 13], which shows similar trend of image saliency shift. Results of the SIFT & K-means is then further validated by a cognitive experiment. The cognitive experiment results show the more complex the image background is, the less objects the participants kept in their visual working memory. This is highly correlated with our new estimated value of the saliency mismatch. The algorithm of SIFT & K-means thus can be a reference to analyze the web site, publication, advertisements, movie frames and other media contents.

2 Background Knowledge

Saliency detection is considered the key for attentional mechanism. In many papers, the location of the image saliency is defined as the region where the viewers paid attention to and is also called the “conspicuity area” [7]. Information is said to be ‘attended’ if the information is kept in visual working memory [12]. It is believed that there is a two-component framework for the control of where the visual attention is deployed: a top down cognitive volitional control and a second faster bottom-up saliency based primitive mechanism [12]. This chapter investigates people’s direct response to visual stimuli, namely the bottom-up primitive mechanism. The visual stimuli which win the competition for saliency is sent to the higher levels of brain neural networks. If the wrong stimuli or the noise is strong, the expected information can be overwhelmed and missed. Thus the mismatch between the supposed visual target and the image saliency can be a serious disturbance for viewers. The viewers can easily miss cues due to the image quality, visual complexity and the other reasons [14, 15]. In cognitive science, the limits of saliency-based information search and the shifts of visual attention is largely due to the limited visual working memory capacity [12, 1618]. Although researchers of visual working memory have claimed different memory capacity limitation (for example in Nelson Cowan’s paper, the visual working memory capacity is 4, while in Miller’s research the capacity limitation is \(7\pm 2\)) [16, 19, 20], until now the 7 [16, 20]. Thus in the following sections, the number limitation of visual working memory capacity is set to be 7, which means the maximum number of salient items that can effectively attract a viewer’s attention is 7. If the number of salient items is higher than 7, some items can be possibly ignored by the viewer, which causes the so called attention competition between the expected target information and the irrelevant ones [12]. The competition results between the expected attention target and the real interest points in an image is the study issue of this paper. For the study of interesting points, SIFT is well recognized as an efficient image feature description method for image recognition [911]. As a bottom-up approach, the SIFT key points detection is not only popular for image feature recognition, but also widely used as a step for saliency estimation [11, 21, 22].

Until now, the improvement of saliency detection algorithm is still going on and many algorithms define different salient regions for one same image. In Sect. 3.1.5, to check the functionality of SIFT & K-means, Itti-Koch and AIM saliency map are implemented [12, 13]. The Itti-Koch method is an unsupervised method, which combines color, intensity and other texture information while the method of AIM makes use of Shannon’s information measure to transfer the image feature plane into a dimension that closely corresponds to visual saliency. The saliency map generated by these two methods differs from each other for complex images but are correlated.

3 A Tentative Measurement Algorithm

3.1 Method

Based on the SIFT Density Map (SDM) described in [21], the algorithm of SIFT and K-means algorithm are implemented to calculate the locations of salient regions in this paper. Then a scale-free distance between the expected target locations and the interesting points is measured. This distance is the estimator of the mismatch. Since the computer recognized saliency differs from each other for different algorithms, it should be noted the algorithm in this paper is not to precisely locate the most possible first attention point, the locations calculated by K-means algorithm is rather a reference for viewer’s possible attention.

3.1.1 SIFT and Key Points

To implement SIFT, all the images are transferred to gray scale at first. We then get the candidate key points from the scale space by finding the maxima and minimum of the convolution of image I(xy) and a variable-scale Gaussian kernel \(G(x,y,\sigma )\) [9, 23]. The scale space of an image is defined as a function \(L(x,y,\sigma )\).

$$\begin{aligned} G(x,y,\sigma )= & {} \frac{1}{2\pi \sigma ^2}e^{-(x^2+y^2)/2\sigma ^2} \end{aligned}$$
(1)
$$\begin{aligned} L(x,y,\sigma )= & {} G(x,y,\sigma )*I(x,y) \end{aligned}$$
(2)

Since the computer generated key points are closely related to the real fixation points which has been validated by [21], here the K-means algorithm is used to find the relevant locations of interesting areas in an image.

A difference-of-Gaussian function, \(D(x,y,\sigma )\) is then calculated. The candidate keys are detected by the maxima and minimum of \(D(x,y,\sigma )\).

$$\begin{aligned} D(x,y,\sigma )= & {} (G(x,y,k\sigma )-G(x,y,\sigma ))*I(x,y) \end{aligned}$$
(3)
$$\begin{aligned}= & {} L(x,y,k\sigma )-L(x,y,\sigma ) \end{aligned}$$
(4)

where k is a constant factor.

After we got the candidate key points, the next step is to have these key points tested and the key points which have low contrast will be rejected by a threshold according to the value of \(|D(\hat{x})|\). The information of locations, scale and ratio of curvatures are calculated for the selected key points.

3.1.2 Distance Parameters Q

Since the computer generated key points are closely related to the real fixation points which has been validated by [21], a K-means algorithm whose cluster center is labeled as \(C_i\) is used to calculate the possible locations of interesting points from the cluster of selected key points. The maximum number of the cluster center is \(n\le min\{7,N_T\}\), where \(N_T\) is the object number. If the nearest expected target object location is labeled as \(T_i\), then the distance between \(C_i\) and \(T_i\) can be expressed as \(\varDelta CT_i\). The scale-free parameter of \(\varDelta CT_i\) is represented as Q in Eq. 5, which is the reference parameter for the saliency mismatch. In Eq. 5, X is the image length and Y is the image width in pixel, and k is a constant factor.

$$\begin{aligned} Q = \frac{\sum _i^n\varDelta CT_i }{n\sqrt{XY}} \end{aligned}$$
(5)

where \(n\le min\{7,N_T\}\).

3.1.3 Experiments and Results

To validate this new algorithm, two experiments were carried out with a dataset of 250 images. The location of the target objects are stored in a database before the target objects are merged with different backgrounds. In our experiments, the target objects are small white balls with numerical mark on it.

3.1.4 Experiment 1

To express the measurement process, the SIFT algorithm is applied to four images shown in Fig. 1. Complexity of these images’ background increases in sequence. In these images, the red diamonds represents the key points cluster center \(C_i\). The blue arrows represents the vector of the selected key points, which indicates the key point’s orientation and scale, are derived from the difference-of-Gaussian function D.

Fig. 1
figure 1

Images of experiment. The red diamond is the key points K-means cluster center and the blue arrows are the vectors of key points

After we implement the SIFT & K-means algorithm introduced in Sect. 3.1, the mismatch value for these four images is listed in Table 1. In the first image, the background is a white plane with very low image complexity, thus the \(C_i\) is registered well with \(T_i\). From the second image, the image background becomes more and more complex, from cloudy sky to the crowded people, the distraction from the target objects to the image background becomes more and more serious. More and more interesting points derived from the SIFT algorithm obviously shift away from the target objects.

3.1.5 Experiment 2

In the second experiment, the same images in Fig. 1 are processed by Itti-Koch and AIM algorithms to have their saliency map. Figure 2 is the result of Itti-Koch method and Fig. 3 is the result of AIM method, where the white area is the salient region. To make a comparison, the red diamond still indicates the location of \(C_i\) of Sect. 3.1.4, similarly the green star in the image represents the target objects.

The saliency map of Itti-Koch and AIM differs from each other, but nearly the same for the first image: the salient regions all registered well with the green target. Similar as the results of the SIFT & K-means method in Sect. 3.1.4, the shift from the salient points starts from the second image Fig. 2b and become most obvious for the fourth image of Fig. 2d. The distance between the green targets and the center of the salient regions is then measured by Eq. 5, similar method as described in Sect. 3.1.2. The mismatch parameter Q of Itti-Koch saliency map is shown in Table 2.

Table 1 Q value of SIFT-K method for Fig. 1

The AIM method can recognize the target objects well for nearly all four images of Fig. 3. However, the shift of the attention is not obvious as the targets are still labeled as salient while the image background becomes more complex and the number of salient regions increase globally. Instead of salient region shift, the AMI method does show the increase of non-target salient information. Especially for Fig. 3c, d, the number of labeled white salient regions is higher than visual attention capacity and thus the viewers can feel difficult to find the target and remember them in visual working memory.

Fig. 2
figure 2

Itti-Koch’s saliency and target objects [12]

Fig. 3
figure 3

AIM’s saliency and target objects [13]

Table 2 Q value of Itti-Koch for Fig. 1

3.1.6 The Key-Points Ratio \(K_{num}\)

Besides the distance parameter Q, the key point number \(K_{num}\) is another important parameter to evaluate the image complexity. The key numbers derived from SIFT algorithm increase with the complexity level, thus can be another important reference for the complexity level, especially when parameter Q lost its ability to distinguish the background saliency and the target memory items (there is possibility that there is overlap between the background saliency and the target memory items) (Tables 3 and 4).

Table 3 \(K_{num}\) list of Fig. 1
Table 4 Five factors of input

4 Human Visual Test

Fig. 4
figure 4

Anova analysis of human visual experiment

The stimuli are the images shown in Fig. 1. Stimuli of this experiment were presented for 70 young participants at one time with \(50\,\%\) male and female at the age of 21 on the average in a large classroom. Each image was presented for 5 s and followed by 30 s memory recall time for the students to note down the numbers they have seen in each image. Four students’ records were detected as outlier and rejected according to standard deviation analysis. The anova analysis of the rest 66 students’ correctly recorded items is shown in Fig. 4. When the image complexity increases, the mean number of correctly recalled items decreases significantly. The test results were shown in Fig. 4. Generally, all three methods’ complexity measurement value is inverse to students’ remembered item numbers. The Spearman correlation between the tested new saliency measurement value and the mean correctly recalled number of items from the students is strong, with \(|r| =0.80\). We then perform the similar test for Itti-Koch’s Q and we have \(|R|=0.78076\) while the correlation for SIFT & K-means is as high as \(|R|=0.96621, p<0.05\).

Although this is not a strict cognitive experiment, this experiment does show the K-means based saliency mismatch estimation method is consistent with human’s visual cognitive visual sense. Besides the above experiments, another 32 images were tested within the similar procedure of Sect. 3.1.

5 Brain Computer Interface Experiments

To further validate the above algorithm and hypothesis, a brain computer interface experiment is carried out. This task in our EEG experiments is to remember the numbers attached to white balls which are scattered in different image backgrounds as shown in Fig. 1. Each number is treated as an item for visual working memory task. The amount of items to be remembered varies from low to high. The stimuli were presented and the participants’ response was recorded by EPrime software. The participants were placed 70 cm in front of a 19 in. LCD screen straight ahead of their eyes horizontally where the visual stimuli were displayed at the center of the participants’ visual field. The stimuli were subtended around \(6.5^\circ \) visual angle. After the stimuli presentation, the participants then had another 1000ms interval of retention time before the memory recall.

6 Image Measurement

The target items to be remembered were numbers which were shown in limited digit length (<4). We have 4 participants tried 252 groups of images, namely 252 trials in total. Target items are all small white balls associated with random numbers in blue, the numbers to be remembered in every image was different from each other. These images were shown to the participants in random sequence and each participant had 50 trials on average. The four image background varies from simple to complex and classified into four levels as shown in Fig. 1. Trials for every participant include all four level image backgrounds.

To measure the image, the target item number and the background textures were labeled as a factor vector XnumXbackXpos. Xnum is the number of items shown per image, whose value varies from 3 to 50 per image. The image complexity level of the image background is the second factor considered. Image stimuli in this research varies from a mono-color background to different texture backgrounds added in the image to increase the complexity level. The image background complexity value is labeled as (Xback), which is calculated according to our algorithm of SIFT & K-means. Another factor worth noting is the position distribution of the items which is referred to as the factor Xpos. Xpos value is high if the arrangement of the items is random and is low if the arrangement of the items is in array. The memory recall process from the participants were recorded and the participants’ responses were automatically calculated at the end of each trial. To make sure the participant is not aware of the factor level change and deliberately change their attention, all the factor values are arranged randomly for each image during the image generation.

6.1 Statistics of Image Measurement Results

The statistics of the total number remembered by the participants Xnum was described in Fig. 6a. Although the participants were provided a glut of items in one image, the maximum number of items recalled is no more than 6 in all. Considering about that there is time related mechanism in brain, some trials were set with longer duration time. The image stimuli were presented within the same 2 s duration time. The total average time an image shown is within 2–3 s. It was also observed the short term memory limitation is also obvious even when an image’s appearance repeated. In an experiment for a same participant, we have tested one image with 16 numbers which has a duration of 1 s appeared three times in less than 15 min in three separate trials. It was normally assumed that the participants got familiar with these images and correct typed in item number should increase significantly. However, participants in this research increased remembered items from 4 to 6, but no more than 6 (Fig. 5).

Fig. 5
figure 5

a Shows the total tested input x, y position in the image compared to the whole image screen. b Shows the correctly picked number x, y positions in the screen. (0, 0) is the center of the screen while the x, y ranges from −1 to 1. In the figure, CV is the abbreviation of coefficient variation

We have observed the main brain activity in frontal, parietal and occipital brain areas. To some extent, the increased visual information load, especially the Xnum and Xback not only increases the visual working memory load, but also a combination of attention, mental and visual working memory. This is because the visual working memory and attention share some common neural substrate [24]. The attention and working memory representation of the brain cortex have been shown to be overlapped in different areas such as frontal and parietal areas [25, 26]. The increase of EEG power is the reflection of combined attention and mental load. The shared representation in brain area is also considered an important reason for the limited working memory capacity. As the attention and mental load increase, the information compete the neural resources in the same brain area.

6.2 Electrophysiological Analysis

The EEG signal was acquired from an electrode cap with 32 channels at 1000 Hz by the BrainVision recorder. The electrode impedance is kept lower than 20 K\(\Omega \). The reference signal VEOG and HEOG were also recorded. The recorded signals were then processed by a professional software BrainVision Analyzer which is a professional software for EEG signal analysis. Multiple signal processing and pattern recognition techniques were employed in this software. In this study, the signal process follows the sequence of dataset preprocessing, IIR Filters, Band Rejection, artifact rejection, frequency filtering, segment analysis and comparison. Any suspicious muscle activity-related artifact has been rejected during the process of artifact rejection and ocular correction ICA (Independent Component Analysis). Frequency higher than 70 Hz has been filtered out during band rejection transform which is used to filter out interference signals due to power supply or to poorly shielded electrical devices. The line noise or called notch frequency (50 Hz) is also filtered out in this process.

Fig. 6
figure 6

a The average correctly recalled item numbers in each Xnum group were summarized from all trials. The error bar is the standard deviation of each group’s remembered items. b The item numbers are divided into 3 groups, the average EEG power from each trial in high load group (\( Xnum\in \{16, 25, 50\} \)) is nearly one time higher than the low load groups (\(Xnum\in \{3,4,5\}\))

The filtered participants’ EEG signal was segmented, the brain state during memory retention time (1 s duration time) is represented by its corresponding EEG power segment [27, 28]. Average EEG power was calculated from these segments based on Fast Fourier Transform (FFT) [2932]. The average EEG power value of that segment is found to be positively correlated with the score of image complexity. Generally when the memory load is increased, the participant’s brain activity also significantly increases compared to the brain’s last state. However, this increase trend is not infinite. The averaged performance and the remembered item number drops after \(Xnum=16\) in Fig. 6a. The results in [33] show the similar brain power limitation for more difficult task, which indicates children and older adults have decreased alpha power with higher memory load. However, in general, the Fig. 6b shows the statistic higher EEG power level with increased target item numbers. Our results show the young adults also have on average weak EEG power when the task difficulty is far beyond their ability. The Pearson correlation test show positive correlation between the EEG power and the \(Xnum (p<0.05\) Bonferroni corrected). When Xnum is certain, the correlation between Xback and the EEG power is also positive \((p<0.05)\). In comparison, the correlation between the Xpos and the EEG power is modest, positive but not significant (\(0.05<p<0.1\)). In summary, EEG power of working memory process is closely related to the main visual complexity factor Xnum and Xback.

7 Conclusion

In this chapter, the relationship between the visual complexity and visual working memory capacity is discussed. Although the relationship between the visual complexity and the visual attention is well known, the relationship between the visual complexity and the working memory capacity is rare to be discussed. The increased visual complexity means higher information load and caused the difficulty of correct visual attention towards the locations of memory target items, as the attention is limited by the visual working memory capacity.

Based on the relationship between the visual working memory and the visual working memory capacity, this chapter introduced a new algorithm SIFT & K-means to measure the discrepancy between the expected target object and the image salient regions. The resulting metric for mismatch and visual complexity calculated from SIFT & K-means algorithm in the first experiment is consistent with the human visual working memory experiments. Results of the second experiment show comparison between this method and two saliency detection methods show the reliability of this algorithm. The SIFT & K-means algorithm can be a reference for the measurement of image quality and image complexity. Both EEG and psychological experiments have the consistent results of visual working memory capacity as our algorithm predicted, because the EEG results clearly show the increased brain activity is needed for the correct visual attention when the visual complexity is high. Our findings from the above experiments prove the link between the visual complexity and the visual working memory capacity is close.