Keywords

1 Introduction

Low-level vision is visual processing that treats images as patterns and makes no specific assumptions about the objects that might be present or the structure of the scene. In short, the processing is generic and intended to be suitable for all images, regardless of their semantic content or high level layout. Examples of low-level vision tasks include segmentation, candidate regions or object proposals, and image enhancement. Low-level processing is typically performed in preparation for high-level tasks, and is used to allocate computational resources for more detailed processing. In mammalian visual systems low-level vision is believed to be performed in the retina and area V1 of the visual cortex.

In this paper we propose a unified framework for several low-level vision tasks (Fig. 8.1) that are typically modeled separately. These tasks include the generation of hierarchical image segmentations, proposing candidate regions for object detection and recognition, base-detail decomposition – where an image is decomposed into a visual summary plus fine details – image enhancement and the prediction of human eye fixations.

Fig. 8.1
figure 1

We propose a unified approach for several low-level visual processes: (a) image segmentation – a hierarchy of image partitions at multiple levels; (b) candidate regions – a pool of possibly overlapping proposals for further study by object recognition methods (best candidates illustrated); (c,d) “base-detail” decomposition – expressing the image as the sum of a non-local smooth appearance term and s residual, or detail, which captures the texture patterns; (e) image enhancement – controlling the amount of detail in the image; (f) saliency for fixations prediction – a model predicting bottom-up human visual attention

We start by producing a hierarchical decomposition of the image into segments which have roughly uniform homogeneity as measured by texture and color cues. Segments at higher levels of the hierarchy are generally larger and less homogeneous. But in our approach, it is important that the size of segments within each level of the hierarchy have different sizes because some image regions (e.g., sky) are much more homogeneous than others (e.g., a road containing several cars).

Different segments of the hierarchy are combined into groups of up to three to make proposals for the positions and shapes of objects and background “stuff” [10], which we refer to as candidate regions. They consist of a pool of 500–1500 regions which are later evaluated by a high-level method, which is out of the scope of this paper. The high-level method computes category-specific scores to identify regions which correspond to object or background categories.

We define “base-detail” decomposition as the separation of the image into a coarse description of the image appearance, and a description containing the texture and details. The image is the sum of both. More precisely, the base is obtained by fitting smooth appearance models (polynomials) to the image segments and the detail is the residual. For examples, see Figs. 8.1c, d and 8.8. This base-detail decomposition enables us to process the image in several ways, such as enhancing the details and/or the base. For example, we can remove the shadows (details) from a grass lawn (base). Surprisingly, as we now discuss, we can use base-detail decomposition to predict human eye fixations for free viewing.

It is well known that when humans examine an image they do not gaze on it uniformly but instead they fixate on certain parts of the image. The fixation saliency model we propose favors small segments which have strong details. This has the following intuition: large segments are typically homogeneous regions (e.g., sky, water, or grass) which may be easily processed (i.e., classifying these regions may be easy using methods which use summary image statistics and do not model the detailed spatial relations). The detail is less important in the large segments but in small segments the detail may correspond to structures which require more detailed models to process. We describe experiments showing that our fixation saliency model predicts human fixations with a state-of-the-art performance on complex datasets, like Pascal [17] and Judd [31].

Our work is motivated both by attempts to understand how primate visual systems work and by efforts to design computer vision systems with similar abilities. We provide a computational model for performing these visual tasks but in this paper we do not develop any detailed biological evidence for this theory. Instead we concentrate on performance on complex visual scenes, instead of artificial stimuli, because we think it is important to model visual abilities in real-world conditions.

2 Background and Related Literature

There is an enormous literature on segmentation much of it using Markov Random Field (MRF) models [22]. Our work follows the alternative strategy of decomposing images into subregions which have roughly similar statistical image properties [1, 33, 45, 52]. There are a variety of hierarchical approaches which exploit the intuition that image structures occur at different scales and that multi-scale is required to capture long-range interactions within the images [19, 53]. Our approach to hierarchies follows the strategy of starting with an over-segmentation of the image, produced by an efficient algorithm like [1], followed by recursive grouping to get larger segments at different levels of the hierarchy [4]. This relates closely to Segmentation by Weighted Aggregation [20], a recent variant [3], and extensions to video segmentation [48].

Detecting candidate regions, which make proposals for the positions and sizes of objects, is a new but increasingly important topic in computer vision. This is because it offers an efficient way to apply powerful methods, such as Deep Convolutional Neural Networks (DCNN) [34], to detect and recognize objects in images. Instead of needing to apply DCNNs exhaustively, at every image position and scale, it is only necessary to apply them to a limited number of candidate regions. Our method for detecting candidate regions differs from existing methods because we propose regions for both objects and background regions or “stuff” (e.g., sky). Recent work on detecting candidate regions includes methods which group segments into combinations [5, 6]. Most methods in the literature have been evaluated for finding segments which cover foreground objects [46], while ours detects background classes as well. Finally, there are other methods which differ in that they mainly exploit the edges instead of the appearance statistics [15, 29, 55]. We should also mention hierarchical segmentation which has been used to learn models of objects [43].

There is no existing work that directly addresses “base-detail” segmentation, but there is a large literature on closely related topics. In the digital image processing community there is a related concept, “base-detail separation”, but it is performed locally [7] by applying bilateral filters. A related topic is gain control which has been studied in primate visual systems, particularly in the retina, and seeks to compress the dynamic range of the input intensity while preserving the local contrast and detail [14, 42]. We note that detection of detail is also at the heart of many super-resolution methods [54] and it is related to image enhancement. Enhancement approaches do not typically use segment-based methods [25, 41] and instead use local methods like the bilateral filter in [7] or the weighted least squares [18]. There are some exceptions, like [49] where segment-wise exposure correction is proposed.

Another related topic is work in the shape from shading community, where intensity patterns are decomposed into smoothly varying shading patterns and more variable texture/albedo components [9, 26, 27] (here the base roughly corresponds to the shading and the texture/albedo to the detail). Researchers in shape from shading make prior assumptions for performing the decomposition into shading and texture/albedo [8] (which are not needed if the same object is viewed under different lighting conditions [47]). Similar decompositions assumptions are also applied to the classic Mondrian problem [32].

Predicting human eye fixations is a long studied research topic [30]. In this paper we address only bottom-up saliency prediction, as performed in a free-viewing task, and do not consider top-down processes involving which involve cognitive factors, e.g., eye fixations when performing a task such as counting the animals in an image. One of the first successful methods for predicting human eye fixations was Itti’s original model [30]. Image signature is a simple method which give good results [28] and other recent methods are reviewed in [11]. The most successful current method is Adaptive Whitening Saliency [21] and we make comparisons to it in our experiments. Finally, there are other works [37, 38] which studies the saliency of visual objects and use candidate regions to make predictions [13]. Objects are judged to be salient based on the number of eye fixations which occur within them. By contrast, fixation saliency only predicts positions and outputs a fixations map (in Sect. 8.4.3, see Fig. 8.16, second column). The eye fixation saliency models we propose is based on base-detail decomposition, which makes it substantially different from any method in the literature. Our experiments show it performs at the state of the art.

Finally, although biology is out of the scope of this paper, we find it interesting that recent biological vision studies suggest that early visual processing is more sophisticated than traditional models of the retina and V1, which mainly emphasize linear spatiotemporal filters. For example, studies of the retina suggest that it is “smarter than scientists believed” [23] and contains a range of non-linear mechanisms which might perhaps be able to implement parts of the theory of theory we propose here. Moreover, there is growing appreciation of the richness of computations that can be performed in area V1 of the visual cortex, including possibly fixation saliency [51].

3 Method

In this section, we describe the details of the proposed approach. We address a set of fundamental low-level vision processes: segmentation, candidate regions and salient objects proposals, base-detail decomposition, image enhancement, and bottom-up saliency. Instead of being treated as separate tasks, we address them in terms of a unified approach of bottom-up vision processing.

3.1 Segmentation: Hierarchical Image Partitioning

Image segmentation is a classic task of low-level vision. But in this paper we do not consider segmentation as a goal in itself. Instead, we seek to obtain a hierarchy of segmentations, or partitions of the image into segments, which can be used as components for other processing, as will be described in the next subsections.

An image partition is a decomposition of the image into non-overlapping subregions, or segments. More formally, we decompose the image lattice \(\mathcal{D}\) into a set of segments \(\{\mathcal{D}_{i}: i = 1,\ldots,n\}\) such that:

$$\displaystyle{\mathcal{D} =\bigcup _{ i=1}^{n}\mathcal{D}_{ i},\ \ \mathrm{s.t.}\ \mathcal{D}_{i}\bigcap \mathcal{D}_{j} =\emptyset,\ \forall i\neq j.}$$

A hierarchical partition of an image is a set of decompositions indexed by hierarchy level \(h = 1,\ldots,H\). Each level gives an image partition \(\mathcal{D} =\bigcup _{ i=1}^{n_{h}}\mathcal{D}_{ i}^{h}\), where n h is the number of segments in the partition at level h. The decompositions are nested so that a segment \(\mathcal{D}_{i}^{h}\) at the hierarchy level h is the union of a subset of segments at the previous level h − 1, so that \(\mathcal{D}_{i}^{h} =\bigcup _{j\in Ch(\mathcal{D}_{i}^{h})}\mathcal{D}_{j}^{h-1}\), where \(Ch(\mathcal{D}_{i}^{h})\) denotes the child segments of segment i at level h (in this paper each segment is constrained to have at most two immediate children, see Fig. 8.2-right). This enables us, by recursion, to express a segment in terms of compositions of its descendants in many different ways. In particular, we can decompose a segment into its descendants at the first level, \(\mathcal{D}_{i}^{h} =\bigcup _{j\in Des(\mathcal{D}_{i}^{h})}\mathcal{D}_{j}^{1}\). This hierarchical structure is common in the segmentation literature, for example in [4]. Figure 8.2 illustrates the hierarchical partitioning of an image.

Fig. 8.2
figure 2

Left: Multiple levels in a hierarchy. Segments with a good coverage of objects or parts may happen at different levels. 80 % to 90 % of the segments can be discarded because they go across boundaries of objects or because they don’t cover a large area of an object. Right: Segments at level h + 1 are composed of one or two segments in level h

In this paper, our hierarchical partitioning is designed based on the following related considerations. Firstly, we prefer segments to have roughly homogeneous image properties, or statistics \(\vec{S}\) (e.g., color/texture/detail) at each level, which means that segments at the same level can vary greatly in size (e.g., segments on the grass in Fig. 8.2 will tend to be larger than in less homogeneous regions of the image, like the dog). Secondly, segments at higher levels should be less homogeneous because they are capturing larger image structures (e.g., by merging more homogeneous image structures together). Thirdly, segments are likely to have edges (i.e., image intensity discontinuities) near their boundaries. Fourthly, we want an efficient algorithm which can dynamically compute this hierarchy using local operations by merging/grouping segments at level h − 1 to compose larger segments at level h.

Our work is guided by standard criteria for image segmentation [33, 45, 52] which propose minimizing a cost function of form:

$$\displaystyle{ E(\{\mathcal{D}_{i}\},\{\vec{S}_{i}\}) =\sum _{i}\sum _{x\in \mathcal{D}_{i}}\vert \vec{S}_{i} -\vec{ S}(x)\vert ^{2} -\lambda \sum _{ i}\sum _{x\in \partial D_{i}}e(x). }$$
(8.1)

Here \(\vec{S}(x)\) denotes image statistics at position x (e.g., color, texture features), \(\vec{S}_{i}\) is summary statistics of the region i, \(\lambda\) is a non-negative constant, and e(x) is a measure of edge strength (taking large values at image discontinuities), and \(\partial D_{i}\) is the boundary of segment \(\mathcal{D}_{i}\).

We initialize our algorithm by using the SLIC [1] algorithm to compute the lowest level, h = 1, of our hierarchy. Essentially, SLIC performs an expectation-minimization of (8.1) for a fixed number n 1 of segments. It uses the color and position as statistics, without including an edge term, that is, \(\lambda = 0\) in (8.1). More precisely, \(\vec{S}(x) = (l(x),a(x),b(x),x)\), where l, a, b specify color channels of the Lab color opponent space and x denotes 2D spatial position.

Next, we proceed to construct the hierarchy by grouping/merging segments which have similar image statistics. The statistics are extended to include texture, shape of segments, and the variance of color (we do not use these statistics at the bottom-level because the segments are too small to compute them reliably). More precisely, \(\vec{S}\) is given by the mean and the standard deviation of the Lab color space components and the first and second derivatives of the l channel, \((l,a,b,\nabla _{x}l,\nabla _{y}l,\nabla _{x}^{2}l,\nabla _{y}^{2}l)\), the centroids of the segment and dimensions of its bounding box \((c_{x},c_{y},d_{w},d_{h})\). When performing merging, we use an asymmetric criterion which requires comparing the difference between the statistics of the union of the two segments i and j, \(\vec{S}_{i\bigcup j}\), and the statistics of its segments \(\vec{S}_{i},\vec{S}_{j}\), that is, \(\vert \vert \vec{S}_{i\bigcup j} -\vec{ S}_{i}\vert \vert \) and \(\vert \vert \vec{S}_{i,\bigcup j} -\vec{ S}_{j}\vert \vert \). This is because our segments are allowed to have different sizes and we want to discourage bigger segments from merging with smaller segments if this will change much the statistics of one of them. Intuitively, a big segment is likely to have little change on its statistics by merging to a small one, but we want to ensure that the small one does not undergo a big change in its statistics. At each level of the hierarchy we allow the top-ranked 30 % segments to merge to another segment (rank is based on asymmetric criterion described above and prioritizes similar segments) but prevent merges where the asymmetric condition is violated. Merging is allowed between 1st and 2nd neighbors only. The precise details are described in [10].

The output is a hierarchical partition of the image. It is expressed as a set of segments \(\{\mathcal{D}_{i}^{h}\},\,1 \leq h \leq H,\,1 \leq i \leq n_{h}\), where h is the hierarchy level. At the highest level, n H  = 1. Each image region \(\mathcal{D}_{i}^{h}\) and has statistics \(\vec{S}_{i}^{h}\). Each level h gives a partition of the image \(\mathcal{D} =\bigcup _{ i=1}^{n_{h}}\mathcal{D}_{ i}^{h}\). Each segment is composed of a set of child segments, \(\mathcal{D}_{i}^{h} =\bigcup _{j\in Ch(\mathcal{D}_{i}^{h})}\mathcal{D}_{j}^{h-1}\). Each segment can also be associated to its descendant segments at the h = 1 level: \(\mathcal{D}_{i}^{h} =\bigcup _{j\in Des(\mathcal{D}_{i}^{h})}\mathcal{D}_{j}^{1}\). This hierarchical partition can be used directly for image segmentation but, in the spirit of this paper, we think of it as a representation that can be used to address several different visual tasks as we will describe in the next few sections.

3.2 Candidate Regions

This section shows how to use the hierarchical partition to obtain candidate regions, or proposals, for both foreground objects and background regions, or “stuff” (e.g., sky, water, grass). Proposing candidate regions enables algorithms to concentrate computational resources, e.g., deep networks, at a limited number of locations (and sizes) in images (instead of having to search for objects at all positions and at all scales). It also relates to the study of salient objects [2, 37], where psychophysical studies show that humans have tendencies to look at salient objects [16]. Note that salient objects, however, do not predict human eye fixations well [12] and these can be better described by bottom-up saliency cues [30] in a free-viewing task. However, methods that combine bottom-up saliency cues with proposals for candidate regions do perform well for both predicting human eye fixations and for the detection of salient objects [38].

We create candidate region proposals by the following strategy. Firstly, we select a subset of selected segments from the hierarchical partition of the image. These segments are chosen to be roughly homogeneous but as large as possible. Secondly, we make compositions of up to three selected segments to form a candidate region. These compositions obey simple geometric constraints (proximity and similarity of size). The intuition for our approach is that many foreground objects and background “stuff”, can be roughly modeled by three segments or less, see Fig. 8.4. This intuition was validated [10] using the extended labeling of Pascal VOC [40] which contained per-pixel labels of 57 objects and “stuff”.

The selected segments are chosen by computing the entropy gain of the combination of two child segments into their parent segment. If the entropy gain is small, then we do not select the child segments because this is evidence that they are part of a larger entity. But if the entropy gain is large, then we add the child segments to our set of selected segments. More precisely, we establish a constant threshold G for the entropy gain g after merging two segments \(\mathcal{D}_{i}^{h},\mathcal{D}_{j}^{h}\) into their parent \(\mathcal{D}_{m}^{h+1} = \mathcal{D}_{i}^{h}\bigcup \mathcal{D}_{j}^{h}\). The entropy gain is defined to be:

$$\displaystyle{ g = \mathcal{H}(\mathcal{D}_{m}^{h+1}) -\left \{\mathcal{H}(\mathcal{D}_{ i}^{h}) + \mathcal{H}(\mathcal{D}_{ j}^{h})\right \}. }$$
(8.2)

Here \(\mathcal{H}(\mathcal{D}_{i}^{h})\) is the entropy of a segment i at level h, computed from the statistics \(\{\vec{S}_{k}^{1}\},\,k \in Des(\mathcal{D}_{m}^{h})\) of its descendant segments at level h = 1 (Fig. 8.3). The entropy is computed in a non-parametric manner [10] using the approximation proposed in [35]. See an example of triplets of selected-segments in Fig. 8.4.

Fig. 8.3
figure 3

Entropy gain (Sect. 8.3.2): When segments a and b are merged, the increase of entropy is not as big as if they were merged with c. Homogeneity criterion (Sect. 8.3.3.1): Segment c is homogeneous. It presents smooth variation due to shading and lighting. Segments a and b are not homogeneous. Both entropy and homogeneity are calculated from the small (first level \(\mathcal{D}_{i}^{1}\)) segments, illustrated with white contours

Fig. 8.4
figure 4

Examples of candidate regions for foreground and background regions. Left-to-right and top-to-bottom: image, top three selected-segments for left car, right car, person, building, grass, ground, trees, and ground truth. Most objects are covered well by two to three selected-segments

3.3 Base-Detail Decomposition

This section analyzes the image intensities within the segments by decomposing the image into base and detail. The base B(x) component is the approximate color of the region, and is required to be spatially smooth. The detail R(x) is the residual \(R(x) = I(x) - B(x)\) and can contain general texture, such as the patterns of grass on a lawn, or structured detail such as the writing on the label of a wine bottle.

Base-detail relates to several well studied phenomena. Firstly, it is similar to the task of preserving image contrast (i.e. the detail) performed by the early visual system when doing gain control. Secondly, it relates to the decomposition \(I(x) = a(x) \cdot \vec{ n}(x) \cdot \vec{ s}(x)\) of images into albedo, normals and illumination when computing intrinsic images or the 2.5D sketch. But this decomposition is higher-level, relying on concepts like geometry and lighting sources, while we are modeling at a lower level. We note that in some special situations the base and the detail of a segment may correspond to the shading and the albedo of an object. Thirdly, base-detail also relates to transparency – e.g., the viewing of images through a dirty window (the dirt is the detail) – or when there is partial occlusion like tree leaves in front of a building (leaves are details). More generally, within image regions there is base appearance which changes smoothly within segments and detail which changes in a more jagged manner. This differs from the base-detail separation [7] studies in the image processing literature, which is obtained by local smoothing methods and not in a segment-wise manner.

We address base-detail decomposition in two steps. Firstly, we seek a segmentation of the image into regions which are as homogeneous and as large as possible. This is done by selecting a subset of those hierarchy segments \(\{\mathcal{D}_{i}^{h}\}\) which are maximally large and homogeneous and form a partition of the image. Note that this includes segments at different levels h of the hierarchy. Secondly, within each segment we fit a low-order polynomial to the color intensities and define the best fit polynomial to be the base (see Sect. 8.3.3.2). We obtain the detail by computing the residual between the image and the base.

3.3.1 Finding Maximally Large Homogeneous Segments

Here we present a criterion for selecting non-overlapping segments from the hierarchy (while in Sect. 8.3.2 we presented a way to select overlapping segments from the hierarchy). We start from the segmentation hierarchy \(\{\mathcal{D}_{i}^{h}\}\) defined in Sect. 8.3.1. We define the heterogeneity of a segment \(\mathcal{D}_{i}^{h}\) by the maximum difference of the statistics of its neighboring descendant nodes at level h = 1. More precisely, we define the heterogeneity of segment \(\mathcal{D}_{i}^{h}\) to be:

$$\displaystyle{ \max _{j,k\in Des(\mathcal{D}_{i}^{h})}\vert \vert \vec{S}_{j}^{1} -\vec{ S}_{ k}^{1}\vert \vert,\;\forall \ d_{ G}(j,k) \leq 2, }$$
(8.3)

where d G (j, k) is the graph distance between j, k at level h = 1 (i.e., we evaluate only the 1st and 2nd neighbors). This criterion considers homogeneous those segments whose statistics at level h = 1 change smoothly across the segment. This typically happens in large segments like sky, roads, animals. Heterogeneous segments will be those which have an abrupt change in their statistics.

We then fix a threshold t m a x and generate an image partition

$$\displaystyle{ p_{t_{max}}(I(x)) \subset \{\mathcal{D}_{i}^{h}\}, }$$
(8.4)

containing the biggest segments whose heterogeneity is less than t m a x . This can be done by starting at the top-level h = H, keeping any node whose heterogeneity is less than t m a x , proceeding to the child nodes otherwise, and continuing down the hierarchy until we reach levels where the heterogeneity threshold is achieved. Thus, the result is a set of non-overlapping segments covering the whole image space. Note that this is different from the entropy gain criterion used in Sect. 8.3.2, which allows to select overlapping segments, as interesting structures can happen at different levels (e.g., windows as a subpart of house).

3.3.2 Base Modeling and Detail

We assume that the image can be expressed as \(I(x) = B(x) + R(x)\) where x is 2D position, B(x) is base and R(x) is detail (residual of the base). Both of them include all image channels. We assume that the base is spatially smooth within each maximally large homogeneous segment and, in particular, that its color intensity can be modeled by a low-order polynomial. We make no assumption about the spatial form of the detail. (Note that for intrinsic images it is typically assumed that the shadows are spatially smooth while the texture/albedo is more jagged.)

More precisely, we define the base color of a segment by a polynomial approximation \(b_{k}(\vec{x}_{i},\vec{\omega })\) of order k, where k ≤ 3. See examples in Fig. 8.5. We apply the polynomial approximation on each channel separately. The number of parameters \(\vec{\omega }\) depends on the order of the polynomial and we use model selection to decide the order for each segment (we must avoid fitting a high-order polynomial to a small segment). These polynomial approximations are of form:

$$\displaystyle\begin{array}{rcl} & & b_{k}(\vec{x},\vec{\omega }) =\vec{ x}^{T}\vec{\omega }, \\ & & k = 0:\:\vec{ x} = 1,\;\vec{\omega }=\omega _{0} \\ & & k = 1:\:\vec{ x} = [1,x_{1},x_{2}],\;\vec{\omega }= [\omega _{0},\omega _{1},\omega _{2}] \\ & & k = 2:\:\vec{ x} = [1,x_{1},x_{2},x_{1}^{2},x_{ 2}^{2},x_{ 1}x_{2}],\;\vec{\omega }= [\omega _{0},\cdots \,,\omega _{5}] \\ & & k = 3:\:\vec{ x} = [1,x_{1},x_{2},x_{1}^{2},x_{ 2}^{2},x_{ 1}x_{2},x_{1}^{3}x_{ 2}^{3},x_{ 1}x_{2}^{2},x_{ 2}x_{1}^{2}],\;\vec{\omega }= [\omega _{ 0},\cdots \,,\omega _{9}]{}\end{array}$$
(8.5)
Fig. 8.5
figure 5

Examples of polynomial base approximations. Left: original. Center: 0-order approximation (i.e., mean). Right: 0-order to 3rd-order approximation

The estimation of the parameters \(\vec{\omega }\) of the polynomial is performed by linear least squares QR factorization [24]. The order k is selected based on the error, with a regularization term biasing towards lower order. See Fig. 8.6-right. The regularization is weighted by ζ, whose value is not critical (it is set to produce models of all orders k, and not only k = 3). In a given segment we have a set of pixels with 2D positions x and color intensity values I c (x). For a given channel c of the segment \(\mathcal{D}_{i}^{h}\), we minimize:

$$\displaystyle{ \min _{\vec{\omega },k}\sum _{x\in \mathcal{D}_{i}^{h}}\left (I_{c}(x) - b_{k}(\vec{x},\vec{\omega })\right )^{2} +\zeta k. }$$
(8.6)
Fig. 8.6
figure 6

Different segments can have different polynomial order k. Left: original. Center: polynomial base approximation. Right: order of the polynomial, where: dark-blue: k = 0, light-blue: k = 1, yellow: k = 2; red: k = 3

We estimate the base B c (x) of each color channel c for the whole image by fitting the polynomial for each maximally large homogeneous segment. Then, we estimate the detail to be the residual \(R_{c}(x) = I_{c}(x) - B_{c}(x)\).

Our current method works well in most cases, see Fig. 8.7, but it is not appropriate for segments where the amount of detail is similar to the amount of base appearance. This happens, for example, for an image of a leafy tree with blue sky behind it. Such situations require a more complex model which has a prior on the details and allows the base to be fit by a more flexible function (but still smooth).

Fig. 8.7
figure 7

Example of detail (right image) in front of different appearance segments: sky, road, and building

“Base-detail” provides a unified model for several visual tasks that are often modeled separately. These include: (I) Elementary tasks such as gain control, which converts the large dynamic range of luminances into a smaller range of intensities which can be encoded by neurons and transmitted to the visual cortex. A standard hypothesis is that it is performed by ganglion cells in the retina, by Difference of Gaussian, or Laplacian of Gaussian [39], filters to preserve the contrast while removing the base. From our perspective, the contrast is the detail. (II) Decomposition of intensity into albedo and shading patterns as required by shape from shading algorithms [26, 27] when used to construct the 2 1/2 sketch [39] or intrinsic image [9]. The difference is that we do not estimate 3D geometry, noting that intrinsic image models make strong assumptions about images which are often invalid (e.g., smooth intensity patterns can be due to light sources at finite distance and not to the geometry of the viewed surface). (III) Separation of texture from background. Here the detail represents the texture patterns, e.g., the blades of grass while the base is a smooth green intensity pattern. (IV) Decomposing images into frequency components. In this case, the detail is analogous to the high-frequencies. But frequency analysis is based on linear analysis of images while our approach is inherently nonlinear because it involves segmentation. (V) Image compression. Base details suggests a strategy where the base efficiently encodes the rough appearance and the detail encodes the rest. It captures the natural intuition that regions which have a lot of detail are harder to compress.

3.4 Image Enhancement

We illustrate how base – detail decomposition can be used for image enhancement. In Fig. 8.8-bottom, we plot \(B(x) +\vartheta R(x)\), for different \(\vartheta\) values, where \(\vartheta\) is a parameter indicating the amount of enhancement. Another example is shown in Fig. 8.9. Our approach opens the doors to segment-wise manipulation, which is useful in common situations like when segments have different illumination.

Fig. 8.8
figure 8

Top: Original I(x), base B(x), detail R(x), detail magnitude | | R(x) | | 2 for better visualization. Bottom: Base + detail \(B(x) +\vartheta R(x)\), with different amounts of detail, \(\vartheta =\{ 0.5,1,2,4\}\)

Fig. 8.9
figure 9

Example of enhanced image. Weak details can be multiplied to become more visible with respect to the base

Note that the widely used bilateral filter [44] is very local compared to our segmentation-based approach. In Fig. 8.10 we show an example of base-detail decomposition produced by bilateral filtering. For example, the top-right cloud in the image cannot be separated as a detail by the bilateral filter, but it is successfully separated as detail using our approach.

Fig. 8.10
figure 10

Limitations of bilateral filter. From left to right: Bilaterally filtered (BLF) image; Residual (detail) of the bilateral filtering; Zoom-in of the residual; Zoom-in of the detail that our segmentwise base-detail decomposition produces

3.5 Saliency

Images of three-dimensional scenes contains structure at different scales and resolutions. Humans often need to foveate specific image locations to acquire higher level of details. For example, a small image blob might correspond to a person walking towards you and require further investigation. In this work we consider only bottom-up attention where fixation saliency is used to predict the first few seconds (3 s) of free-viewing of an image. The prediction consists of a probability map which does not take order of fixation into account. (By contrast, in top-down attention humans actively search for specific objects or scene structures.)

Our saliency model takes as input the base-detail decomposition B c (x), R c (x), generated for a partition \(p_{t_{max}}(I(x))\), defined in (8.4), whose minimum homogeneity threshold is t m a x (Sect. 8.3.3). Note that candidate regions are not used here. Each image pixel x is assigned to the segment i(x) which contains it and we define \(\vert \mathcal{D}_{i(x)}\vert = size(\mathcal{D}_{i}),\) if \(x \in \mathcal{D}_{i}\), where \(\{\mathcal{D}_{i} \in p_{t_{max}}(I(x))\}\) are the segments of the partition. Similarly, we evaluate each segment’s average detail and assign this value to all pixel positions of the segment support, obtaining \(A(x) = \frac{1} {size(\mathcal{D}_{i})}\sum _{z\in \mathcal{D}_{i}}R^{A}(z)\) if \(x \in \mathcal{D}_{i}\). Here, R A(z) is the mean of the detail’s n c  = 3 color channels at position z, that is, \(R^{A}(z) = \frac{1} {n_{c}}\sum _{c=1}^{n_{c}}R_{c}(z)\). We use the segment sizes and the segmentwise average detail to weight the maximum-channel detail \(R^{M}(x) =\max _{ c=1}^{n_{c}}R_{c}(x)\) (see Fig. 8.11). The weight we propose is given by \(W(x) = \sqrt{A(x)/\vert \mathcal{D}_{i(x) } \vert }\).

$$\displaystyle{ saliency(x) = R^{M}(x)\left [\left (1-\gamma \right )W(x)+\gamma \right ]. }$$
(8.7)

Here, γ is a small number, γ = 0. 15 in our experiments. It allows to keep a fraction of R M(x) unweighted. This is useful for pixels whose weight is close to zero, W(x) ≈ 0. The γ parameter means that the detail R(x) is never completely ignored.

Intuitively, we relate the detail (Fig. 8.11-left) to bottom-up saliency. However, we penalize detail which belongs to large segments, without eliminating it completely (Fig. 8.11-right). An illustration of how an image would look like with this kind of detail penalization is shown in Fig. 8.12. The use of the segment size as an important saliency factor could be related to figure-ground pre-attentive mechanisms in V1. In terms of V1 neuron responses, very small regions tend to be highlighted against larger regions [50], but in this paper we do not address neurophysiology.

Fig. 8.11
figure 11

From left to right: Maximum-channel detail R M(x); segmentwise average detail A(x); segment size \(\vert \mathcal{D}_{i(x)}\vert \); weight factor \(W(x) = \sqrt{A(x)/\vert \mathcal{D}_{i(x) } \vert }\); \(saliency(x) = R^{M}(x)[(1-\gamma )W(x)+\gamma ]\)

Fig. 8.12
figure 12

An illustration of how our saliency model penalizes the detail in large roughly-homogeneous segments. The representation on the right is obtained by \(I_{c}^{{\prime}}(x) = B_{c}(x) + W(x)R_{c}(x)\), for each color channel c

Our hypothesis is that regions which cannot be described by a simple model require foveation. This is the case of small regions with a lot of detail. The segments that are less likely to require foveation are those which are fit well by a simple polynomial model (have little detail), as well as those which have detail but are large. In the latter case, the detail is likely to be due to a texture pattern, e.g., grass.

Classical models (e.g., [30]) use multiscale processing. Instead of this, our segment-based approach adapts to local scales of images. Also, unlike classical models, we do not explicitly use neural mechanisms such as center-surround receptive fields and lateral inhibition mechanisms. But it can be argued that base-detail decomposition is implicitly accomplishing similar functions.

The proposed fixation saliency method predicts human fixations. Note that this is different from salient object proposals. It is possible to link human fixations predictions and candidate regions by machine learning, as shown in [38]. But in this work we do not address this issue.

4 Experiments

In this section, we present results of the candidate region proposals (Sect. 8.3.2) and the bottom-up saliency (Sect. 8.3.5) as a prediction of free-viewing human fixations. Both of them are based on the bottom-up segmentation we propose (Sect. 8.3.1). The fundamental theory behind the saliency method is the base-detail decomposition (Sect. 8.3.3). We do not evaluate base-detail decomposition and image enhancement because there is no natural way of doing it. We do not focus on image segmentation, so we do not include experiments on it.

4.1 Datasets

Many of the classic datasets are biased because they were collected with a specific purpose, i.e., for saliency experiments. They are mostly composed of iconic photographs, presenting a clearly salient and centered object over a simple background. But this is highly atypical of natural images, which typically include many objects with complex relations and partial occlusions (humans rarely see iconic images). Hence it arguably more realistic to study saliency on natural image datasets such as Pascal, which has been one of the leading reference benchmark in Computer Vision for the last years. Recently, Hou et al. [38] released the free-viewing fixations of 8 participants on a subset of 850 images of Pascal (first 3 s). In this subset we have an average of 5.18 foreground objects per image and an average of 2.93 background objects. An extreme case is the rightmost image in Fig. 8.13, which has 52 foreground objects, most of which are far from the center of the image. A representative case is the third from the right image in Fig. 8.13, with 6 foreground objects.

Fig. 8.13
figure 13

(a) Examples of iconic images from ImgSal [36]; (b) Examples from a non-iconic dataset: Pascal VOC [17]

For our candidate regions experiments, we use a subset of 1,288 images of Pascal VOC, for comparison with [6], as detailed in [10]. For the bottom-up saliency experiments, we use the 850 images of Pascal-S which include human fixations. We also experiment on the 1,003 images of the standard dataset Judd [31], which can be considered non-iconic, although we have no statistics of the number of objects or their distribution in the images.

4.2 Candidate Regions

In this section, we evaluate the coverage of our candidate regions. Initially, we obtained an average of 116 selected-segments per image after selection and from these we make an average of 721 combinations which constitute the pool of candidate regions per each image. The evaluation metric is Intersection over Union (IoU), which accounts the number of pixels of the intersection between a candidate region and a groundtruth region, divided by the number of pixels of their union.

We evaluate the generated candidate regions with the 57-classes ground truth, containing both foreground and “stuff” classes. We compare our Candidate Regions (CR) to three state-of-the-art methods. (I) The classical Constrained Parametric Min-Cuts (CPMC) [15] method is designed for foreground objects, which explains its better performance on foreground objects. Their overall performance on the 57 classes is lower than our performance. (II) In [6], the segment combinations are generated by taking combinations of the 150 segments (on average) that their hierarchical segmentation approach outputs for each image. Their method is more sophisticated than ours and we observe that they tend to get larger and less homogeneous segments than we do. Our performance is lower but comparable: 74 % IoU versus 77 % for [6]. But we achieve it with nearly half the number of combinations – 721 compared to 1,322 – and with a simpler and faster algorithm (4 s per image in its Matlab prototype). In Table 8.1 we refer to their segments as UCM-combs and to our candidate regions as CR-combs. (III) The Selective Search [46] method is competitive in terms of speed. Our method outperforms theirs on the region candidates task (74.0 % compared to 67.8 % IoU), with less than half the number of proposals. (Note, however, that [46] present results for bounding boxes and not for regions). See an example of the proposals generated by our CR-combs method in Fig. 8.14. See Table 8.1 with the region-based IoU and recall results.

Fig. 8.14
figure 14

Left-to-right and top-to-bottom: Original image, top three segments for bike, wall, snow, rock, and ground truth. Note that the segments are good even for object classes that perform poorly overall (e.g., bike)

Table 8.1 Region-based IoU (in %) comparison. CPMC [15], UCM [6], Sel. Search [46], and our CR – candidate regions. Boldface denotes the first and second best results

4.3 Saliency

The fixation saliency method that arises from our unified approach predicts free-viewing human fixations surprisingly well. Despite only accounting for saliency within segments and not taking into account inter-segment saliency, our method is among the highest ones in complex datasets like Pascal-S [38] and Judd [31]. Pascal is a particularly interesting case because state-of-the-art fixations methods have low performance on it (perhaps because they were developed and tested for iconic images). In Pascal our method outperforms the state of the art. On the Judd dataset only AWS [21] outperforms our method.

In Fig. 8.15 we show a comparison of our Base-Detail Saliency (BDS) method, Adaptive Whitening Saliency, AWS [21], Image Signature, SIG [28], and L. Itti’s original model [30]. In Fig. 8.16 we show some examples for qualitative comparison between the results of the different algorithms.

Fig. 8.15
figure 15

Bottom-up saliency performance. Left: Pascal-S dataset [38]. Right: Judd dataset [31]. Approaches compared: Our Base-Detail Saliency (BDS), Adaptive Whitening Saliency (AWS, [21]), Image Signature (SIG, [28]), L. Itti’s original model (Itti, [30])

Fig. 8.16
figure 16

From left ro tight: Original; Human – fixations collected on 8 subjects with free-viewing task, first 3 s [38]; Itti’s original model [30]; Spectral signature [28]; AWS [21]; Our Base-Detail Saliency (BDS) Bonev and Yuille

It is hard to determine the failure modes of our saliency algorithm. Our reliance on segmentation may seem problematic. It is known that segmentation is an ill-posed problem and no low-level segmentation algorithm exists that can reliably detect the boundaries of objects without top-down assistance. But our approach is more robust because we rely only on a proto-segmentation. Still errors in the segmentation can cause errors in the base-detail decomposition which may cause our approach to fail.

5 Conclusions

We propose a unified approach addressing a set of early-vision bottom-up processes: segmentation, candidate regions, base-detail decomposition, image enhancement, and saliency for fixations prediction.

Our unified approach allows the segmentwise decomposition of the image into “base” and “detail”. This proves to be more versatile than a local smoothing of the image. It provides directly for image enhancement, for a novel model of fixation saliency. It is related to other vision topics which are usually formalized as different problems.

We show state-of-the-art results on our candidate regions and on our saliency for free-viewing fixation prediction. For the latter we use the psychophysics data available for the Pascal VOC dataset, which is non-iconic and particularly difficult for the state-of-the-art saliency algorithms.