1 Introduction

1.1 Texture Synthesis

Texture synthesis is the process of generating new texture images from a given image sample. Being able to perform such a synthesis is of practical interest both for computer graphics, where it is needed to give a realistic aspect to 3D models, or for image restoration, where it may help the reconstruction of missing parts through inpainting.

One of the most difficult aspect of texture synthesis is the ability to deal with the wide variety of texture images. In particular, it is relatively difficult to synthesize textures at all observation scales using a given model. Before being more specific in the next section, we can roughly split texture synthesis methods in two groups. On the one hand, non-parametric patch-based methods, initiated by the seminal works from Efros and Leung [10] and Wei and Levoy [39], have proven their efficiency to synthesize highly structured textures, such as brick walls or pebbles seen at a close distance. On the other hand, methods relying on statistical constraints, initiated by the work of Heeger and Bergen [16], provide efficient methods for the synthesis of textures with small scales oscillations, or micro-textures, such as sand or stone.

In this work, we propose to rely on a variational approach to build a synthesis method that allows at the same time for the reconstruction of geometric textures, thanks to the use of sparsity in a patch-based dictionary, and the reconstruction of small scale oscillations, thanks to constraints on the spectrum of images. Before presenting in more details the contributions of this paper, and in order to explain both its inspiration and its originality, we start with a short presentation of previous works in this field.

1.2 Previous Works

1.2.1 Patch-Based Approaches

The first methods yielding realistic texture synthesis rely on a Markovian hypothesis: the distribution of a pixel given the rest of the texture only depends on the values in a (small) neighborhood. These methods are mostly parametric, following the work in [7]. The law of a pixel given its neighborhood is chosen a priori and its parameters can be learned from an exemplar using a maximum-likelihood estimation.

Non-parametric approaches [10, 39] were introduced later to handle different kind of textures without designing the law of each of them by hand. For a given pixel, its value is sampled directly from an exemplar image, by seeking pixels whose square neighborhood (called a patch) is similar to the one of the pixel to be synthesized. This approach was a clear breakthrough in the field of texture synthesis, producing results that are visually almost perfect on difficult and structured examples. It was followed by a large body of works and many extensions were proposed, to perform real-time [23], dynamic, or volumetric texture synthesis, for instance. In particular, it is common to sample a small square area (a patch) at a time, instead of a single pixel [9]. The reader may refer to the state-of-the-art [38] for an overview of these approaches applied in different contexts.

One common drawback of these approaches, however, is that the resulting texture is often a plain juxtaposition of small pieces from the exemplar, even when sampling pixels one at a time [1]. In this sense they have a limited innovation capacity. An efficient way to avoid this drawback is to perform the synthesis from a dictionary of patches that is learned from the exemplar. Sparse decomposition methods such as [11] are classically applied to image denoising or enhancement. They rely on the assumption that each patch of an image can be decomposed as a sum of a small number of elements from a dictionary. The dictionary is usually learned from the image to be restored. These approaches were introduced in [28] as an efficient coding of natural images, inspired by the human vision system. More efficient algorithms for dictionary learning were proposed later: for instance the method of optimal direction (MOD) [12], and the K-SVD algorithm [2]. Recently, one author of the present paper has shown that this framework is well suited to texture modeling [30]. In particular, sparse dictionary learning has the remarkable ability to adaptively extract basic elements, or textons, from a texture. In [30], it is shown that new texture patches can be created from a learned dictionary, simply by imposing a sparsity constraint on their use of the dictionary atoms.

1.2.2 Statistical Approaches

Another set of approaches is based on statistical constraints. A set of statistics is identified and the synthesis process consists in generating a new image whose selected statistics match those of the original texture. The basic principle of the approach is in agreement with early works on texture discrimination [18], that spotted statistical constraints as a fruitful tool to investigate the human perception of textures.

In 1995, Heeger and Bergen [16] also proposed to rely on some statistical constraints to perform texture synthesis. Starting from an exemplar image, a new image is generated by imposing the marginal of wavelet coefficients, separately at each scale and orientation, as well as the image color distribution. The resulting method, although limited to relatively weakly structured textures, has the ability to produce realistic results in a computationally efficient manner. Later, it has been proposed [33] to impose some second order statistics on wavelet coefficients. More precisely, the constraints are based on the correlations between coefficients of atoms that are neighbor, either at a given scale or across scales. This allows for a better preservation of structured textures. To the best of our knowledge, this relatively old method is the one permitting to synthesize the largest class of textures, without simply copy-pasting pieces from an exemplar.

In a different direction, one may synthesize a texture by imposing constraints on its Fourier transform. In [14], it is shown that an efficient synthesis method is achieved by imposing the spectrum of synthesized images, through random phase textures. The resulting synthesis is fast and reliable, but limited to non-structured textures with small scales oscillations. Several works have extended this approach, either simplifying the reference from which the synthesis is performed [8, 15] or developing video-texture synthesis methods [41].

In the field of computer graphics, procedural noises are widely used to yield a realistic aspect for materials. Contrary to other methods, a procedural noise is generated on-the-fly in any point of \(\mathbb {R}^n\), and thus has a very low memory requirement and can be evaluated at any scales with any resolution. The Perlin noise [29] is the most classical example of such a noise. It is obtained as a linear combination of colored noises at different scales, resulting in a multi-scale noise whose spectrum is controlled by the weights of the combination. Sparse convolutive noise is another class of procedural noise, defined as the convolution of a compact kernel with a sparse convolution map. The Gabor noise is a particular case, using Gabor kernels: the distribution of scales and frequencies of the kernels can be either imposed by hand [22] or learned from a sample texture [15]. The state-of-the-art [21] provides a complete review of procedural noise functions.

1.2.3 Variational Formulation for Texture Synthesis

In this work, we advocate a generic variational approach to texture synthesis, as presented in detail in Sect. 2. This is not the first such approach in the context of texture manipulations. Some variational approaches define a function to measure the similarity between the patches of two images. This similarity measure is used in [20] to optimize an image so that all its patches look similar to a texture given as an example. Other variational approaches like [34] aim at transferring the texture of an image onto the content of another image. In this case, the functional to be minimized takes into account both the fidelity to the texture image and the fidelity to the content image.

1.3 Contributions of the Paper

In this paper, we introduce a generic texture synthesis method that is both suited to highly structured textures and textures that exhibit small scale oscillations.

First, we rely on adaptive sparse dictionary for the synthesis, following the ideas presented in [30]. As in this original approach, we impose a sparsity constraint on a decomposition of the patches into a dictionary. In addition, we also impose that atoms from the dictionary are used in the same proportions as in the original texture samples. This allows for a faithful reproduction of the structured parts (edges, corners, etc.) of the input texture. Second, we impose spectrum constraints on the synthesis, in the manner of [14]. This allows for the reproduction of high frequency components of the texture, but also, more surprisingly, for the low frequency regularity of the texture, as demonstrated in the experimental section. We also impose a constraint on the global color content of the image, that has a strong visual impact on the results. In order to combine the strength of the adaptive patch dictionary to reproduce geometry and the fidelity of frequency reproduction offered by spectrum constraints, we rely on a variational approach. Each constraint is enforced by controlling the distance of the synthesis to a set of compatible images. The minimization of the corresponding energy is performed using an averaged projections method. The approach is compared to the more classical alternating projection method proposed in [33], and the benefit of the proposed constraints is demonstrated. The resulting synthesis approach achieves, to the best of our knowledge, the best results obtained so far that truly generate a new texture without performing copy-pasting of the input. Another interesting asset of the proposed approach is that it only relies on first order statistical constraints between atoms from a dictionary. On the contrary, methods relying on a fixed dictionary necessitates second order statistics to synthesize structured textures [33]. This somehow accredits the second psychophysical theory proposed by B. Julesz [18], stating that texture discrimination is related to first order statistics of textons (in the presented work, textons may be identified with atoms from the dictionary). Observe nevertheless that the synthesis also relies on second order statistics between pixel values (through the power spectrum), therefore establishing a link between Julesz first [17] and second [18] theory for texture discrimination.

The plan of the paper is as follows. We define in Sect. 2 the energy function whose minimization leads our synthesis method and present in details the three constraints we impose for the synthesis: on the histogram, on the spectrum, and on the sparse decomposition of image patches. We then propose in Sect. 3 an algorithm to find a critical point of this energy function. Numerical results are shown, parameters are discussed and comparisons with the state of the art are provided in Sect. 4. Preliminary results of our work were presented in the conference paper [36]. More results can be found at:

http://perso.enst.fr/~gousseau/jmiv2014synthesis/,

and the Matlab code used for the experiments is publicly available at:

https://bitbucket.org/gtartavel/variational_synthesis/.

1.4 Notations

A matrix \(A=(A_j^i)_{i,j}\) is made of columns \(A_j\) and rows \(A^i\); its \(2\)-norm is \(\left\| A\right\| _{2,2}\) and its Frobenius norm is \(\left\| A\right\| _F\). The \(\ell ^0\) pseudo-norm of a vector \(b\) is \(\left\| b\right\| _0 = \# \{i \,\,\big \backslash \,\,b_i\ne 0\}\) and its \(\ell ^p\) norm is \(\left\| b\right\| _p\) for \(p>0\).

The indicator function \(\iota _{{\fancyscript{C}_{\mathrm {}}}}\) of a set \({\fancyscript{C}_{\mathrm {}}}\) is equal to \(0\) on \({\fancyscript{C}_{\mathrm {}}}\) and \(+\infty \) outside. The distance of a point \(x\in \mathbb {R}^m\) to a compact set \({\fancyscript{C}_{\mathrm {}}}\) is \(\fancyscript{D}(x,{\fancyscript{C}_{\mathrm {}}}) = \min _{y\in {\fancyscript{C}_{\mathrm {}}}} \left\| x-y\right\| _2\) and a projection \(\fancyscript{P}_{\fancyscript{C}_{\mathrm {}}}(x)\) of \(x\) on \({\fancyscript{C}_{\mathrm {}}}\) is a minimizer of this distance.

The orthogonal discrete Fourier transform of an image \(u\) defined on a rectangular domain \(\varOmega \subset \mathbb {Z}^2\) of size \(M_1\times M_2\) is the image \(\hat{u}\) made of the coefficients

$$\begin{aligned} \hat{u}(m) = \frac{1}{\sqrt{M_1 M_2}} \sum _{x\in \varOmega } u(x) \; \exp \left\{ -2\mathrm {i}\pi \big (\frac{x_1 m_1}{M_1} + \frac{x_2 m_2}{M_2}\big ) \right\} \!.\nonumber \\ \end{aligned}$$
(1)

The patches of size \(\tau \times \tau \) in \(u\) are defined as

$$\begin{aligned} p_k = \big ( u(x_k+t) \big )_t \quad \text {for}\quad t\in \left\{ 0,\dots ,\tau -1\right\} ^2. \end{aligned}$$
(2)

Their locations \(x_k\) lie on a regular grid \(x_k\in \varDelta \mathbb {Z}^2\) of step \(\varDelta \in \mathbb {N}^*\). We denote by \(\varPi (u) = \big (p_k\big )_k \in \mathbb {R}^{L\times K}\) the matrix, made of the \(K\) patches of \(u\), where \(L=d\tau ^2\) is their dimension, with \(d=1\) for gray images and \(d=3\) for color. The adjoint operator of \(\varPi \) is denoted by \(\varPi ^*\).

2 Variational Formulation

As explained above, we define an energy functional \(E\) to assert the quality of a synthesis. The synthesis process then consists in finding local minima with low values of this functional.

The functional \(E\) is designed to account for three constraints: one on the first order statistics on the use of atoms from an adaptive, sparse dictionary, one on the spectrum of images, that is on second order pixel values statistics, and one on color histograms, that is on first order pixel values statistics. This is achieved through the distances to 3 sets: the set \({\fancyscript{C}_{\mathrm {p}}}\) of patches being sparse in a dictionary \(D_0\) learned from \(u_0\) and whose atoms frequency matches the ones of the decomposition of \(u_0\) in this dictionary, the set \({\fancyscript{C}_{\mathrm {s}}}\) of images whose Fourier power spectrum is the same as in \(u_0\), and the set \({\fancyscript{C}_{\mathrm {h}}}\) of images whose color histogram is the same as \(u_0\).

We therefore define \(E(u)\) equals to

$$\begin{aligned} \frac{\alpha _p}{2}\fancyscript{D}^2(\varPi (u),{\fancyscript{C}_{\mathrm {p}}}) + \frac{\alpha _s}{2}\fancyscript{D}^2(u,{\fancyscript{C}_{\mathrm {s}}}) + \frac{\alpha _h}{2}\fancyscript{D}^2(u,{\fancyscript{C}_{\mathrm {h}}}), \end{aligned}$$
(3)

where \((\alpha _p,\alpha _s,\alpha _h)\) are weighting terms.

The adjoint operator \(\varPi ^*\), involved in the optimization process, reverberates the constraints on the patches to the image. The sets \(({\fancyscript{C}_{\mathrm {p}}},{\fancyscript{C}_{\mathrm {s}}},{\fancyscript{C}_{\mathrm {h}}})\) are detailed hereafter. The projection operators \(\fancyscript{P}_{\fancyscript{C}_{\mathrm {}}}\) on these sets are also defined and presented in the following paragraphs since they are involved in the iterative optimization process detailed in Sect. 3. We start with the most straightforward histogram constraint, then proceed with the Fourier constraint, and eventually present the most involved, dictionary-based constraint.

2.1 Histogram Constraint

The constraint \({\fancyscript{C}_{\mathrm {h}}}\) imposes the histogram to match the one of the exemplar. In other words, we want to preserve the gray-level or color distribution of the texture. This requirement is common in texture synthesis: it is used in [16] or [30] to ensure that the synthesized image has a similar dynamic and color distribution as the original image.

2.1.1 Definition

We define the set of images whose color histogram is the same as the one of \(u_0\) as:

$$\begin{aligned} {\fancyscript{C}_{\mathrm {h}}}= u_0 \big (\varSigma _{\left| \varOmega \right| }\big ) = \left\{ u_0\circ \sigma \,\,\big \backslash \,\,\sigma \in \varSigma _{{\left| \varOmega \right| }} \right\} , \end{aligned}$$
(4)

where \(\varSigma _{{\left| \varOmega \right| }}\) is the set of permutations of \(\varOmega \). Two images have the same histogram if they only differ by a permutation of their pixels.

2.1.2 Projection

The projection on \({\fancyscript{C}_{\mathrm {h}}}\) is called “histogram transfer” [32]. In the case of grey-level images (\(d=1\)), the projection \(u_0\circ \sigma ^*\) of an image \(u\) on the set \({\fancyscript{C}_{\mathrm {h}}}\) is given by

$$\begin{aligned} \sigma ^*=\sigma _{u_0}\circ \sigma _u^{-1} \end{aligned}$$
(5)

where we denote by \(\sigma _v\) the permutation sorting the pixel values of an image \(v\):

$$\begin{aligned} v_{\sigma _v(0)} \le \dots \le v_{\sigma _v(i)} \le v_{\sigma _v(i+1)} \le \dots \end{aligned}$$

When \(u\) and \(u_0\) have different numbers of pixels, the sorted sequence of values \(u_0\circ \sigma _{u_0}\) is oversampled using a nearest-neighbor approximation. Note that a linear or cubic interpolation is not adapted since it creates new gray level values. This may create artifacts if the histogram to be interpolated has a gap (as for example with a mixture of Gaussians).

The color case is more involved: algorithms to compute the exact optimal assignment (such as the Hungarian method of [19]) have roughly cubic complexity, and can thus not be used for our problem. [5] proposes to replace the multi-dimensional Wasserstein distance by a summation of 1D Wasserstein distances of projections of the distributions. We follow this approach but with only 3 fixed projection directions. For \(i=1\dots 3\), let \(c^{(i)}\in \mathbb {R}^3\) be the colors obtained by a PCA on the RGB components of \(u_0\).

We denote respectively by \(u_0^{(i)}\) and \(u^{(i)}\) the components of \(u_0\) and \(u\) in this color basis. We replace the set \({\fancyscript{C}_{\mathrm {h}}}\) by \({\fancyscript{C}_{\mathrm {h}}}'\) by allowing a different permutation for each component:

$$\begin{aligned} {\fancyscript{C}_{\mathrm {h}}}' = \left\{ u \,\,\big \backslash \,\,\forall i, \exists \sigma _i\in \varSigma _{{\left| \varOmega \right| }}: u^{(i)} = u_0^{(i)}\circ \sigma _i\right\} . \end{aligned}$$
(6)

The projection is thus obtained by solving 3 independent histogram transfers between each \(u_0^{(i)}\) and \(u^{(i)}\).

2.2 Fourier Spectrum Constraint

The Fourier transform of the empirical covariance of the image \(u_0\) is \({\left| \hat{u}_0\right| }^2\), where \(\hat{u}_0\) denotes the discrete Fourier transform of \(u_0\), defined in (1). The Fourier spectrum thus encodes the correlations between the pixels in the image. As explained in the introduction, these correlations are an important characteristic of textures [14, 18], which motivates their inclusion in the functional (3).

2.2.1 Definition

The set of images whose spectrum is the same as \(u_0\) is

$$\begin{aligned} {\fancyscript{C}_{\mathrm {s}}}= \left\{ u\in \mathbb {R}^{\varOmega \times d} \,\,\big \backslash \,\,\forall m, \exists \varphi (m): \hat{u}(m) = e^{\mathrm {i}\varphi (m)} \hat{u}_0(m) \right\} \nonumber \\ \end{aligned}$$
(7)

where the multiplication by \(e^{\mathrm {i}\varphi (m)}\) preserves the amplitude but changes by \(\varphi (m)\) the phase of the coefficient \(\hat{u}_0(m)\in \mathbb {R}^d\). Note that \(\varphi \) must be antisymmetric modulo \(2\pi \) since \(u\) and \(u_0\) are real images. In the gray-level case (\(d=1\)), the condition can be rewritten as \({\left| \hat{u}(m)\right| } = {\left| \hat{u}_0(m)\right| }\). In the color case (\(d=3\)), it is important for color coherency that the multiplication of \(\hat{u}_0(m) \in \mathbb {C}^3\) by \(e^{i\varphi (m)}\) preserves the phase differences between the R, G, and B channels [14]. Therefore, texture synthesis is performed in [14] by preserving the spectrum of \(u_0\) while generating phases \(\varphi (m)\) randomly. It is equivalent to draw at random an element from \({\fancyscript{C}_{\mathrm {s}}}\).

To reduce the boundary artifacts caused by the inherent periodicity assumption of the discrete Fourier transform, we use the periodic decomposition of [27], so that the image \(u_0\) is decomposed as the sum of a smooth image and a (circularly) periodic image. We drop the smooth component and only keep the periodic one, which is well suited to the discrete Fourier transform.

2.2.2 Projection

The projection of an image \(u\) on the set \({\fancyscript{C}_{\mathrm {s}}}\) consists in putting together the modulus of \(\hat{u}_0\) and the phases of \(\hat{u}\). The solution, as detailed in Appendix , is the image \(u_s\) defined by

$$\begin{aligned} \hat{u}_s(m)=\frac{\hat{u}(m)\cdot \hat{u}_0(m)}{{\left| \hat{u}(m)\cdot \hat{u}_0(m)\right| }}\hat{u}_0(m) \quad \forall m \end{aligned}$$
(8)

where \(x\cdot y=xy^*\) denotes the hermitian product of \(\mathbb {C}^d\) (\(d=1\) for gray level images and \(d=3\) for color images). The case where \(u\) and \(u_0\) have not the same dimensions is treated by windowing and padding \(u_0\) as presented in [14].

Remark

As shown in the experimental section (Sect. 4), the spectrum constraint \({\fancyscript{C}_{\mathrm {s}}}\) handles a drawback of the patch sparsity constraint \({\fancyscript{C}_{\mathrm {p}}}\) described in the following paragraph: patches cannot handle the low frequencies of the image because they are too small to capture them. Patches also reduce high frequencies because of the sparse decomposition based on the \(\ell ^2\) norm which promotes smooth approximations. The spectrum constraint is thus a good candidate to go along with the patch constraint.

2.3 Sparse Decomposition of the Patches

Decomposing the patches \(\varPi (u)\) of an image \(u\) as a sparse linear combination \(D_0 W\) of elements in an adaptive dictionary \(D_0\) has first been proposed by [30] in the context of texture synthesis. As explained in the introduction, this framework enables us to efficiently synthesize highly structured aspects of textures, such as edges or regular patterns.

The set of patches \(\varPi (u_0)\) of \(u_0\) is first factorized into a dictionary \(D_0\) and a sparse matrix \(W_0\) of coefficients. In this first case, both the dictionary and the coefficients are unknown: the K-SVD algorithm [11] computes these two matrices \(D_0\) and \(W_0\). Then, during the synthesis, any set of patches \(P\) is decomposed into this fixed dictionary \(D_0\) and leads to a weight matrix \(W\). The weights of this decomposition must satisfy two constraints: each patch can use only a few number of atoms from \(D_0\), and each atom of \(D_0\) must be used as often in \(W\) as it is in \(W_0\).

2.3.1 Learning Stage

The learning stage consists in building an over-complete dictionary \(D_0\) from the exemplar \(u_0\). This stage is performed only once before the synthesis process.

A dictionary \(D_0\) of \(N\) atoms is obtained by minimizing

$$\begin{aligned} (D_0,W_0)&= \mathop {{{\mathrm{arg\,min}}}}\limits _{D,W} \left\| \varPi (u_0) - DW\right\| _F^2 \nonumber \\&\quad \text {s.t.} \quad {\left\{ \begin{array}{ll} \left\| D_n\right\| _2 \le 1 &{} \forall n=1\dots N, \\ \left\| W_k\right\| _0 \le S &{} \forall k. \end{array}\right. } \end{aligned}$$
(9)

This non-convex combinatorial problem is NP-hard [37]. An approximated solution can be computed using the K-SVD [2] algorithm for instance, or the MOD [12] algorithm.

The number \(N\) of elements of the dictionary must be chosen so that \(D_0\) has some redundancy. We choose, for instance, \(N\) to be approximately equal to \(2\tau ^2\) to have a redundancy factor of 2.

The sparsity constraint \(\left\| W_k\right\| _0\le S\) imposes to each patch \(p_k\) to be approximated using at most \(S\) atoms from the dictionary. This constraint enforces the dictionary to represent the patterns and features of the texture, as shown in [30].

2.3.2 Definition of the Constraint

The dictionary \(D_0\) being learned, we control both the sparsity of the decomposition and the number of occurrences of the atoms. As explained in the introduction, this is an algorithmic interpretation of the second Julesz principle, stating that first order statistics of textons (in the present context, of atoms) are important for the visual discrimination of textures. Precisely, the second constraint is that each atom \(d_n\) should be used at most \(F^n\) times. The bound

$$\begin{aligned} F^n = \frac{K}{K_0}\Vert W_0^n\Vert _0 \end{aligned}$$
(10)

is learned from the decomposition \(W_0\) of the exemplar \(u_0\) and is normalized according to the numbers \(K\) and \(K_0\) of patches in \(u\) and \(u_0\) respectively (the exemplar and the synthesis may have different dimensions).

The resulting set of image patches satisfying these constraints is

$$\begin{aligned} {\fancyscript{C}_{\mathrm {p}}}= \left\{ D_0 W \,\,\big \backslash \,\,\left\| W_k\right\| _0 \le S \text {and}\left\| W^n\right\| _0 \le F^n \quad \forall k,n\right\} .\nonumber \\ \end{aligned}$$
(11)

Observe that this constraint results in a constraint on the number of non-zero coefficients both on the rows and the columns of the weight matrix \(W\).

2.3.3 Projection

Computing the projection \(D_0 W\) on \({\fancyscript{C}_{\mathrm {p}}}\) is a combinatorial problem quite similar to [11] which is known to be a NP-hard problem [37]. We approximate a solution of this problem using the greedy Algorithm 1. The result is then improved using a back-projection step, as detailed hereafter. This algorithm is inspired from the Matching Pursuit (MP) algorithm [26].

figure d

2.3.4 Algorithm Details

The algorithm is iterative. The coefficients \(W_k^n\) are updated one by one until the constraints \({\fancyscript{C}_{\mathrm {p}}}\) detailed in (11) are saturated. At each step, the choice of the couple patch/atom \((k,n)\) to be updated is optimal. The non-zero coefficients of the resulting weight matrix \(W\) are then refined (during the back-projection step).

Greedy algorithm. We denote by \(E_{n,k}\) the elementary matrix whose only non-zero coefficient is \(1\) at position \((n,k)\). At step \(\ell \) of Algorithm 1, the current estimation of \(W\) is denoted by \(W^{\left( \ell \right) }\). Both a patch index \(k^*\) and an atom index \(n^*\) are chosen according to

$$\begin{aligned} (k^*,n^*,w^*) = \mathop {{{\mathrm{arg\,min}}}}\limits _{k,n,w} \left\| P-D_0 \big (W^{\left( \ell \right) }+ w E_{n,k}\big )\right\| ^2 \end{aligned}$$
(12)

under the constraint

$$\begin{aligned} W^{\left( \ell \right) }+ wE_{n,k} \in {\fancyscript{C}_{\mathrm {p}}}. \end{aligned}$$
(13)

The coefficient \((k^*,n^*)\) of \(W^{\left( \ell \right) }\) is updated while the others are left unchanged:

$$\begin{aligned} W^{\left( \ell +1\right) }= W^{\left( \ell \right) }+ w^* E_{n^*,k^*}. \end{aligned}$$
(14)

As shown in Appendix , the solution of (12) is

$$\begin{aligned}&(k^*,n^*) =\mathop {{{\mathrm{arg\,max}}}}\limits _{(k,n)\in \fancyscript{I}_{W^{\left( \ell \right) }}} {\left| \langle R_k^{\left( \ell \right) },D_n \rangle \right| }\end{aligned}$$
(15)
$$\begin{aligned}&w^* = \langle R_k^{\left( \ell \right) },D_n \rangle \end{aligned}$$
(16)

where \(R_k^{\left( \ell \right) }\) is the \(k^\text {th}\) column of the residual \(R^{\left( \ell \right) }\) defined at step \(\ell \) by

$$\begin{aligned} R^{\left( \ell \right) }= P - D_0 W^{\left( \ell \right) }, \end{aligned}$$
(17)

and the set \(\fancyscript{I}_{W^{\left( \ell \right) }}\) of available indices is

$$\begin{aligned} \fancyscript{I}_W=\left\{ (k,n)\,\,\big \backslash \,\,\left\| W_k^{\sim n}\right\| _0 <S \quad \text {and}\quad \left\| W_{\sim k}^n\right\| _0 <F^n \right\} ,\nonumber \\ \end{aligned}$$
(18)

where we denote \(W_k^{\sim n} = \big (W_k^{n'}\big )_{n'\ne n}\) and \(W_{\sim k}^n = \big (W_{k'}^n\big )_{k'\ne k}\). In the very particular case where \(D_0\) is orthogonal, this algorithm converges in at most \(KS\) iterations because the resulting \((k^*,n^*)\) are different at each iteration: the \(K\) constraints \(\left\| W_k\right\| _0 \le S\) are saturated after \(KS\) iterations. In general, the algorithm does not converge in a finite number of iterations. Nevertheless we decide to stop after \(\lambda KS\) iterations with \(\lambda =1.5\). The reader can refer to [26] to derive more properties about convergence.

Efficient computation. To save computation time, residuals \(R^{\left( \ell \right) }\) and inner products \(\varPhi ^{\left( \ell \right) }= \big ( \langle R_k^{\left( \ell \right) },D_n \rangle \big )_{n,k}\) can be pre-computed and updated by

$$\begin{aligned} R^{\left( \ell +1\right) }&= R^{\left( \ell \right) }- w^* D_0 E_{n^*,k^*} \end{aligned}$$
(19)
$$\begin{aligned} \varPhi ^{\left( \ell +1\right) }&= \varPhi ^{\left( \ell \right) }- w^* D_0^T D_0 E_{n^*,k^*}. \end{aligned}$$
(20)

Using a max-heap search for (15), precomputing \(D_0^T D_0\), and assuming that \(S\ll L\le N\ll K\), the time complexity of this algorithm is in \(\fancyscript{O}\big (KN(L+\lambda S\log K)\big )\), as shown in Appendix . Note as a reference that the computation of all the inner products \(\langle P_k,D_n \rangle \) already requires \(\varTheta (KNL)\) operations.

Back-projection. The iterative greedy algorithm described above provides weights \(\tilde{W}\) to decompose \(P\) as \(D_0 W\in {\fancyscript{C}_{\mathrm {p}}}\). A drawback of the greedy approach is that the weights are estimated one by one: the approximation error may be quite large. The back-projection step —as introduced in [26]— improves this solution by refining all the non-zeros coefficients of \(\tilde{W}\) while the constraints from \({\fancyscript{C}_{\mathrm {p}}}\) are still satisfied.

The support of \(\tilde{W}_k\) is denoted by

$$\begin{aligned} I_k=\left\{ n \,\,\big \backslash \,\,\tilde{W}_k^n\ne 0\right\} . \end{aligned}$$
(21)

Its cardinal satisfies \(\#(I_k)\le S\) since \(\tilde{W}\in {\fancyscript{C}_{\mathrm {p}}}\). The back-projection step consists in computing the projection of \(P_k\) on \({{\mathrm{Span}}}\big ((D_n)_{n\in I_k}\big )\). This is done by solving a linear system in dimension at most \(S\): the back-projected weights \(W_k\) are

$$\begin{aligned} W_k^{I_k}=\mathop {{{\mathrm{arg\,min}}}}\limits _w \left\| P_k - D_{I_k} w\right\| _F^2 \end{aligned}$$
(22)

and \(0\) elsewhere, where \(W_k^I = \big (W_k^n\big )_{n\in I}\) and \(D_I = \big (D_n\big )_{n\in I}\).

Patch extraction and image reconstruction. The patch extracting operator \(\varPi \) involved in the function (3) extracts the patches \(P=\varPi (u)\) from any image \(u\).

Its adjoint operator \(\varPi ^*\) is involved in the optimization process described in Sect. 3. The effect of \(\varPi ^*\) on a set of patches \(P\) is to merge all of them into an image \(\tilde{u}=\varPi ^*(P)\) by summing the overlapping parts of the patches.

In our previous work [36], we replaced this operator by a non-linear operator \(\varPi _\text {NL}\), because the linear operator \(\varPi ^*\) introduced some blur when summing the overlapping parts of the patches. In the current paper, we introduced the spectrum constraint \({\fancyscript{C}_{\mathrm {s}}}\) which ensures the synthesized image to be sharp enough. As a result, we do not need the non-linear operator any longer. During the optimization process in Sect. 3, we use the true adjoint operator \(\varPi ^*\) involved in the gradient of the function (3).

3 Optimization

In the previous section, we have defined and justified the variational framework we propose for texture synthesis. The corresponding energy is expressed as a weighted sum of distances to constraint sets. For each of these sets, a projection operator has been defined. In this section, we present a general optimization procedure that is suited to such an energy, resulting in the texture synthesis Algorithm 2 and its multi-scale version Algorithm 3. Before proceeding, let us observe that the presented algorithm only yields local and non unique solutions to the proposed non-convex variational approach. From an optimization point of view, this is a pain. But from a synthesis point of view, this is good news: several (local) minimizers correspond to as many possible texture syntheses.

3.1 General Formulation

The penalty function \(E\) defined by (3) can be rewritten as

$$\begin{aligned} E(u) = \frac{1}{2} \sum _i \alpha _i \fancyscript{D}^2(A_i u,{{{\fancyscript{C}_{\mathrm {}}}}_i}) \end{aligned}$$
(23)

where the indices \(i\) of the sum are \((\mathrm {p},\mathrm {s},\mathrm {h})\), the linear operators \(A_i\) are \((\varPi ,{{\mathrm{Id}}},{{\mathrm{Id}}})\), and \({\fancyscript{C}_{\mathrm {}}}_i\) and \(\alpha _i\) are respectively the constraints and the weighting terms defined in Sect. 2.

This function measures how close some linear measurements \(A_i u\) are to the subsets \({{{\fancyscript{C}_{\mathrm {}}}}_i}\). We want these distances to be as small as possible and therefore look for a local minimizer of the energy \(E\).

3.2 Minimization

We use a gradient descent scheme to minimize the energy \(E\). As explained in the following paragraph, there is no theoretical guaranty about the convergence of this algorithm. However, local properties of the gradient are favorable for the cluster points of the algorithm to be critical points of the functional \(E\), as desired.

Let \(\fancyscript{K}_i\) be the cut locus of \({{{\fancyscript{C}_{\mathrm {}}}}_i}\), that is, the points on which the distance \(\fancyscript{D}_{{{\fancyscript{C}_{\mathrm {}}}}_i}\) is not differentiable. Apart from certain of these points, the projection \(\fancyscript{P}_{{{\fancyscript{C}_{\mathrm {}}}}_i}\) is uniquely defined. Let \(\fancyscript{K}= \bigcup _i A_i^{-1}\big (\fancyscript{K}_\text {p}\big ) \) be the union of their reciprocal images. In any \(u\notin \fancyscript{K}\), the functional \(E\) is \(\fancyscript{C}^1\). Its gradient is

$$\begin{aligned} \nabla E(u) = \sum _i \alpha _i A_i^* \big (A_i u - \fancyscript{P}_{{{\fancyscript{C}_{\mathrm {}}}}_i}(A_i u)\big ), \end{aligned}$$
(24)

where the projectors \(\fancyscript{P}_{{{\fancyscript{C}_{\mathrm {}}}}_i}\) are given in Sect. 2. On the contrary, for \(u\in \fancyscript{K}\), the projections and thus the gradient are not uniquely defined; however, any projection provides a descent direction for such \(u\). Observe also that \(\fancyscript{K}\) is a measure-zero set.

In order to find a local minimum of (23), we perform a gradient descent from a random point \(u^{(0)}\). The resulting sequence is defined by

$$\begin{aligned} u^{\left( \ell +1\right) }= u^{\left( \ell \right) }- \tau \nabla E(u^{\left( \ell \right) }) \end{aligned}$$
(25)

where \(\nabla E\) is the gradient (24) of \(E\), resulting in Algorithm 2. This gradient is not globally nor locally Lipschitz, mainly because of the spectrum constraint \({\fancyscript{C}_{\mathrm {s}}}\). The same problem is encountered in phase retrieval problems [4]. In order for the gradient descent to converge, the step size should decrease to \(0\) or be chosen adaptively by line-search. In practice, we choose a constant step size \(\tau = 1/c\) with \(c = \sum _i \alpha _i \left\| A_i^* A_i\right\| _{2,2}\) and motivate this choice in the next paragraph. Even if there is no theoretical convergence guaranty, numerical experiments show that the gradient descent converges in practice to a stationary point of \(E\) (see Sect. 4).

figure e

3.3 Averaged Projections

This paragraph points out that the averaged projection algorithm is a particular case of the proposed gradient descent scheme.

We consider the case of a perfect tiling, i.e. when all the pixels belong to \(Z\) patches exactly where \(Z \triangleq \lceil \tau /\varDelta \rceil ^2 = \left\| \varPi ^*\varPi \right\| _{2,2}\), which is the case for integer ratio \(\tau /\varDelta \in \mathbb {N}^*\) of patch size over spacing. In this case, \(\varPi ^*\varPi = Z\cdot {{\mathrm{Id}}}\). Gradient descent step (25) becomes in this case

$$\begin{aligned} u^{\left( \ell +1\right) }= c^{-1} \sum _i \alpha _i A_i^* \fancyscript{P}_{{{\fancyscript{C}_{\mathrm {}}}}_i}(A_i u^{\left( \ell \right) }), \end{aligned}$$
(26)

where the constant \(c\) previously defined is simply \(c = \tilde{\alpha }_p+ \alpha _s+ \alpha _h\) with \(\tilde{\alpha }_p=\alpha _pZ\). Relation (26) is a step of averaged projections: the transforms \(A_i u^{\left( \ell \right) }\) are projected on the respective sets \({{{\fancyscript{C}_{\mathrm {}}}}_i}\) and are then averaged together.

A cluster point \(\tilde{u}\) of the iterates (26) is a stationary point of the energy (23) and satisfies

$$\begin{aligned} \tilde{u} = c^{-1} \sum _i \alpha _i A_i^* \fancyscript{P}_{{{\fancyscript{C}_{\mathrm {}}}}_i}(A_i \tilde{u}). \end{aligned}$$
(27)

This provides a geometrical interpretation of the solutions: each solution is the barycenter of the projections of its transforms on the sets \({{{\fancyscript{C}_{\mathrm {}}}}_i}\) weighted by \(\alpha _i\).

Alternated projections. Instead of using averaged projections (26), it is possible to use alternated projections, which gives the following iterations:

$$\begin{aligned} u^{(3\ell +1)}&= Z^{-1} \varPi ^* \fancyscript{P}_{\fancyscript{C}_{\mathrm {p}}}(\varPi u^{(3\ell )}), \\ u^{(3\ell +2)}&= \fancyscript{P}_{\fancyscript{C}_{\mathrm {s}}}(u^{(3\ell +1)}), \\ u^{(3\ell +3)}&= \fancyscript{P}_{\fancyscript{C}_{\mathrm {h}}}(u^{(3\ell +2)}), \end{aligned}$$

in the case of \(\tilde{\alpha }_p=\alpha _s=\alpha _h=1\). The convergence properties of both averaged and alternated non-convex projections on smooth manifolds are analyzed in [24].

Observe however that this scheme has no chance to converge since the sets \({{{\fancyscript{C}_{\mathrm {}}}}_i}\) are distinct in general. It leads to \(3\) cluster points, namely \(1\) point in each set \({{{\fancyscript{C}_{\mathrm {}}}}_i}\). Recall that averaged projections lead to a cluster point \(\tilde{u}\) (27), which is a compromise between the \(3\) constraints \({{{\fancyscript{C}_{\mathrm {}}}}_i}\).

The experimental results in the next section (Fig. 11 in particular) show that the alternated projections algorithm is more likely to produce artifacts than the averaged projections algorithm. Note that several texture synthesis methods, like [16] and [33], are implicitly based on an alternated projections scheme. As explained in [16], it is better not to iterate alternated projections too many times because the results may suffer from artifacts.

3.4 Multi-scale Procedure

Since the energy (3) is non-convex, the choice of the initialization \(u^{(0)}\) has a strong influence on the result. In order to avoid visually unsatisfying local minima, we perform the synthesis through \(J\) scales using a classical multi-scale scheme summarized in Algorithm 3.

At scale \(j\in \left\{ J-1,\dots ,0\right\} \), the original image \(u_0\) is decimated by a factor \(2^j\) to give an image \(u_j\). The synthesis is performed using \(u_j\) as an exemplar and a dictionary \(D_j\) learned on it. The result \(u^{(\infty )}_j\) is oversampled by a factor \(2\) and used as initialization \(u^{(0)}_{j-1}\) for the next scale. The roughest scale is initialized with a uniform white noiseFootnote 1 \(u^{(0)}_{J-1}\). The resulting image is the synthesis \(u^{(\infty )}_0\) obtained at the finest scale.

Decimation and oversampling operations are performed as follows. The image to be decimated is first filtered by a \(2\times 2\) box kernel and then decimated by a factor \(2\). The oversampling step is performed with a bi-cubic interpolation. The shift of half a pixel is obtained using the cubic convolution kernel \(h=(-1,9,9,-1) / 16\).

figure f

3.5 Implementation Details

Periodicity of the synthesis. During the synthesis process, because of the spectrum constraint, the image is assumed to be periodic. The histogram constraint and the white noise initialization are consistent with this assumption since all pixels are considered independent. The oversampling step is made periodic by using a circular convolution. The patch processing is made consistent by defining the patch extracting operator \(\varPi :u\mapsto \varPi (u)\) on the periodic image. Some patches thus contain pixels both on the very left side and on the very right side of the image. As a result, the synthesized image is also periodic. It can be tiled or shifted circularly.

Spacing between patches. As defined in the notation section (Sect. 1.4), the patches are located on a grid of step \(\varDelta \). To avoid mosaic artifacts, the value of \(\varDelta \) must be as small as possible, \(1\) or \(2\) pixels for instance. But the lower the slower. A higher value, for instance \(4\) pixels, provides a trade-off between synthesis quality and computation time.

Note that the offset of the grid of patches can be freely chosen (recall that the image is assumed periodic). Choosing a random offset for each iteration reduces the mosaic artifact caused by a large \(\varDelta \). In practice, a step size of \(\varDelta =4\) pixels with patches of size \(\tau =12\) pixels does not produce mosaic artifacts with this technique and increases the speed of the algorithm (a few minutes for a \(256\times 256\) px image).

Post processing. To increase the grain of the resulting texture, we add an extra projection step at the very end of the multi-scale synthesis process, resulting in the final Algorithm 3. The image \(u^{(\infty )}_0\) synthesized at the finest step is projected on the spectrum constraint and then on the histogram constraint. The final image is \(u^{(\infty )}= \fancyscript{P}_{\fancyscript{C}_{\mathrm {h}}}\circ \fancyscript{P}_{\fancyscript{C}_{\mathrm {s}}}\big ( u^{(\infty )}_0 \big )\).

4 Numerical Results

This section provides numerical results of our synthesis algorithm. We illustrate the influence of each term of the energy function (3) and of the parameters. We compare our results to the most classical synthesis methods relying on statistical constraints. We also compare the innovation capacity of the approach to that of exemplar-based methods.

When not explicitly specified, the extracted patches are of size \(\tau =12\) and are centered on a grid with a step size of \(\varDelta =4\) pixels. The synthesis is performed over \(J=3\) scales and the terms of the functional are given the same weightFootnote 2 \(\alpha _h=\alpha _s=\tilde{\alpha }_p=1\). The dictionary is made of \(N=384\) atoms and the sparsity is set to \(S=4\) coefficients per patch.

Under these settings with images of size \(256\times 256\) using a quad-core CPU at \(2.3\) GHz, the preliminary learning stage performed by the SPAMS toolbox [25] takes roughly a minute, and the synthesis using our unoptimized Matlab code also takes about a minute. As a comparison, exhaustive patch-searching algorithms such as the one from [10] take several minutes, while fast improvements like those from [23] or [3] run in a couple of seconds.

Figure 1 shows several results of our algorithm for textures from the VisTex [31] database. The sand and water textures are well reproduced thanks to the spectrum constraints. The regularity of quasi-periodic textures is ensured by both the spectrum constraint and the multi-scale scheme. The sharpness of the edges and the geometrical patterns are handled by the sparse decomposition of the patches into the adaptive dictionary.

The last row of Fig. 1 shows difficult examples to illustrate the limitations and the failure cases of our algorithm. Repetition of small sharp objects —like the pills texture— cannot be represented by our approach: the sparsity of the patches cannot handle several objects in a patch, and the spectrum constraint is not able to create sharp objects. A mixture of large scale patterns and small details —like the tiles or the pumpkins— are also difficult to generate because of the patch constraint: small patches cannot deal with large structures, whereas large patches cannot handle small details because of the sparsity constraint.

Fig. 1
figure 1

Result of our algorithm (big images) for several exemplar (small images). The last examples are the most difficult because they simultaneously exhibit many small sharp details and large geometrical structures

4.1 Comparison

4.1.1 Statistical Methods

We first show a comparison with other statistical methods that we now recall.

The approach of Heeger and Bergen [16] imposes the histogram of the image and also the distributions of the coefficient of a multi-scale decomposition [35], scale by scale. An alternated projection method is used: the constraints are imposed turn by turn. As explained in [16], this should not be iterated more than 4 or 5 times because of convergence issues. The decomposition being based on gabor-like filters, this method is not adapted to synthesize edges or structures. This method may be tested online using the IPOL demo from [6].

The method from Portilla and Simoncelli [33] is the state of the art among statistical methods, to the best of our knowledge. Several constraints on the coefficients of the multi-scale decomposition [35] are imposed: mean, variance, and other moments. The main difference with [16] is that correlations are imposed between neighbor coefficients, whereas [16] only considers marginals at each scale and orientation. Some edges and sharp structures can be synthesized thanks to these dependencies. Let us emphasize that, in contrast, the method presented in this paper does not impose dependency between coefficients corresponding to different atoms in the dictionary. However, we used an adaptive dictionary learned on the exemplar, whereas [16, 33] use a non-adaptive pyramid transform.

In [14], synthesis is performed by preserving the amplitude of the Fourier spectrum and by randomly shuffling the phases of the Fourier Transform. This is strictly equivalent to generating a white noise and projecting it on our spectrum constraint \({\fancyscript{C}_{\mathrm {s}}}\). This approach is fast and is well adapted to smooth textures such as sand or a cloudy sky, but cannot produce edges. This method may be tested online using the IPOL demo from [13].

The algorithm of [30] is the first applying sparse decomposition of patches to texture synthesis. A redundant dictionary is learned beforehand on an exemplar of the texture. The synthesis is then performed from a white noise image by alternating histogram transfer and sparse coding into the dictionary. This is equivalent to alternated projection on our sets \({\fancyscript{C}_{\mathrm {h}}}\) and \({\fancyscript{C}_{\mathrm {p}}}'\) where \({\fancyscript{C}_{\mathrm {p}}}'\) has no constraint on the number of usage \(F^n\) of the elements in the dictionary.

In Fig. 2, we show several synthesis examples using our approach and the statistical methods recalled in this section. As we can see, the proposed approach has the best ability to reproduce both structures and fine details of textures.

Fig. 2
figure 2

We present synthesis results using our approach and other statistical methods. From left to right: our approach penalizes a deviation of histogram, spectrum, and sparse approximation of patches; [33] imposes statistics including correlations in a wavelet frame; [16] imposes the image histogram and wavelet coefficient distributions (scale by scale); [14] preserves only the spectrum modulus; [30] imposes the image histogram and patch sparsity

4.1.2 Copy-Paste Methods

Figure 3 shows two synthesis results obtained with the Efros and Leung algorithm [10], using the IPOL accelerated implementation [1]. The synthesis is performed by choosing and copying pixels one by one from the exemplar. The key idea of this algorithm may be encountered in a large number of more recent algorithms detailed in the start-of-the-art [38], including [20] illustrated on Fig. 4.

Fig. 3
figure 3

Results from the pixel-by-pixel copy-paste algorithm [10] (middle); the coordinate maps (bottom) show that the results are tilings from the exemplars (top), even if the synthesis is performed pixel after pixel

Fig. 4
figure 4

Results from the more recent copy-paste algorithm [20] (center); like in Fig. 3, the coordinate map (right) shows that the result is tilings from the exemplars (left)

Figure 3 also displays on the bottom the map of coordinates provided by the IPOL implementation [1]. Figure 4 shows a map of coordinates computed a posteriori from a result of [20]. This map represents the location in the exemplar of the synthesized pixels: pure black (resp. red, blue, white) means that the synthesized pixel comes from the top-left (resp. top-right, bottom-left, bottom-right) corner of the exemplar image. The structure of the synthesized image is clearly visible on these maps: the synthesis is made of a tiling from the exemplar even if pixels are copied one by one. The results of such copy-paste methods are often visually better than those obtained using statistical methods, but their innovation capacity is much lower. In contrast, results from our approach (see Fig. 1) include no repetition from the input image.

The next subsection provides experiments about the innovation properties of synthesis algorithms.

4.1.3 Innovation Capacity

We carried out the following experiment to demonstrate the innovation capacity of several texture synthesis algorithms including the proposed one.

Given an input image \(u_0\) and a corresponding synthesis \(u\) using any algorithm, we find the closest neighbor in \(u_0\) of each patch of \(u\). Figure 5 shows, from left to right: the synthesized images using different algorithms, the locations of the nearest-neighbor of each \(3\times 3\) patch (using the color map described above), a zoom on the central part of this map, and the histogram of the distances from each patch to its nearest-neighbor. The synthesis images are obtained using our algorithm, the random phase algorithm from [14], the statistical pyramid-based approach from [33], and the pixel-based approach from [10]. The random phase algorithm from [14] is a reference in term of innovation capacity since it generates a new image at random from a gaussian model.

The location maps (middle columns in Fig. 5) show that our algorithm and the one from [33] produce an image which is locally different from the input image, in the same way as the random phase texture obtained using [14]. On the contrary, the location map corresponding to [10] has many homogeneous regions (see the map and its zoom in Fig. 5): these regions show that the synthesized image is actually a tiling made from the input image. The distance histograms (right column in Fig. 5) show that the images synthesized using our approach has no \(3\times 3\) regions which are exactly the same as in the input image, and neither have the results using the algorithm of [33] or the random phase from [14]. On the opposite, one third of the pixels synthesized by the method from [10] has exactly the same \(3\times 3\) neighborhood as in the input image, as shown by the first bin of the histogram: this bin has been made wider and smaller to fit on the page, its area represents the proportion of distance that are almost zero.

The location maps and the distance histograms show that our algorithm and the one from [33] perform a genuine synthesis, while the pixel-based approach of [10] generate a texture which is a piecewise copy of the exemplar image.

Fig. 5
figure 5

Analysis of the innovation capacity of several algorithms. For each patch of the synthesized image, we find its nearest patch in the input image. From left to right: synthesis image obtained from different algorithms, map of location of the nearest patch, \(4\times \) zoom on its central part, and histogram of the distances to the nearest patch. From top to bottom: input image, synthesized images using [10], using [33], using [14], and using our algorithm. Both location map and distances histogram show that the pixel-based synthesis algorithm from [10] mostly create a tiling, whereas our algorithm and the one from [33] are able to generate novel images. Note that the wide black bar on the top-right histogram correspond to the zero bin and has been made wider and smaller to fit on the page

4.2 Detailed Analysis of the Synthesis Functional

The influence of each of the three terms of the function (3) is illustrated in Fig. 6. The effect of each constraint is demonstrated by suppressing one of them while preserving the two others. This experiment yields the following observations.

The patch decomposition term relies on the adaptive dictionary, which is good at representing geometrical features. The syntheses produced without this constraint have little geometrical content: edges and sharp details are completely lost.

On the other hand, a sparse decomposition cannot represent granularities of textures like sand or rocks. This is due to the noisy aspect of such textures which cannot be sparsely represented. It is approximated by an almost-constant value and the texture generated is too smooth when using only histogram + patch sparsity terms.

The spectrum term acts as a complement of the patch sparsity term. It is powerful to represent granularities, as may be seen on the sand example. However, the spectrum cannot produce any edge, as illustrated by the synthesis with only the spectrum and histogram terms. Preserving only the spectrum is proposed and illustrated in [14].

We also remark that the low frequencies imposed by the spectrum enhance the regularity of the texture. Without this constraint, the results have some brighter or darker regions, whereas the results using the spectrum are more stationary.

The histogram term deals with the contrast and the color faithfulness of the result. Note that the patch decomposition alone has no control on the patch dynamic.

Fig. 6
figure 6

Synthesis results when dropping one of the three terms. The histogram term prevents from a loss of contrast. The spectrum term spatially regularizes the synthesis and generates granularity. The patch decomposition term handles sharp edges. Note that the post-process is not performed to allow a fair comparison of each term

4.3 Influence of the Parameters

This section illustrates the influence of the parameters of our synthesis algorithm: the size \(\tau \times \tau \) of the patches, the number \(J\) of scales, the weights \(\alpha \) of each term of the function, and the sparsity factor \(S\).

4.3.1 Patch Size and Multi-scale

Figure 7 illustrates the influence of the multi-scale process and of the size of the patches.

Fig. 7
figure 7

Synthesis results with different numbers of scales (top) and different sizes of patches (bottom). Multi-scale ensures spatial coherency of the texture. The patches must roughly have the size of the patterns of the texture

The effect of the multi-scale is clearly visible on all the results: it increases the spatial coherence of the texture. In the case of quasi-periodic textures with small patterns, it increases the regularity of the texture: the patterns are more regularly distributed. In the case of textures with large patterns, it produces large structures such as long and continuous edges. This is not possible with the single scale synthesis.

The size of the patches must be roughly the size of the elements in the texture. A patch size of \(12\times 12\) pixels is a good compromise on images of size \(128\times 128\) pixels from the VisTex database.

If the patches are too small compared to the patterns of the texture, the pattern cannot be well synthesized and suffer from geometrical distortions. On the contrary, using too big patches makes the sparse approximation into \(D_0\) rougher and the small details are lost. Moreover, bigger patches imply more coefficients, a bigger dictionary, and less sparsity: the sparse decomposition algorithm becomes far slower and less accurate (we recall that we use a heuristic since the problem is NP-hard).

Figure 8 shows the synthesis of a higher resolution image of \(512\times 512\) pixels, which illustrates the limitations of our method. The size of the patches is set to \(18\times 18\) and \(5\) scales are used. The algorithm is able to synthesize an image with a similar aspect as in the lower resolution case (\(128\times 128\) pixels). However the zoom in Fig. 8 shows that the result is less sharp, mainly because the patches are bigger. Synthesizing very large geometric structures as well as small details is challenging. A multi-scale scheme with a varying size of patches may be more appropriate for such a case.

Fig. 8
figure 8

Synthesis of a \(512\times 512\) image using \(5\) scales and patches of size \(18\times 18\). The synthesized image has the same appearance as in the \(128\times 128\) image. We see on the zoomed result that fine details are lost since bigger patches are used: a more evolved multi-scale scheme may be necessary to handle both very large and very thin patterns.

4.3.2 Weights of the Function

Figure 9 shows results obtained when varying the weighting of the different terms in the functional (3) instead of choosing \(\alpha _h= \alpha _s= \tilde{\alpha }_p= 1\).

This default choice provides reasonable results for most of the textures, but the weights can be tuned for each texture to be synthesized. We did not find an easy way to automatically compute a good set of weights for each texture, although this would be useful in practice.

The synthesis given in Fig. 9 are obtained, from left to right, with the following settings of \(\alpha _s/\tilde{\alpha }_p\) : \(5/1\), \(3/1\), \(1/1\), \(1/3\), and \(1/5\), and always with \(\alpha _h= (\alpha _s+\tilde{\alpha }_p)/2\). Textures with large structures, long edges, or geometrical elements, are better synthesized with a higher weight for the patch sparsity term. On the contrary, a texture without sharp nor structured elements but with a granular aspect is better reproduced with more weight on the spectrum term. In the extreme case of a texture without any geometrical element, the patch term can be removed: see the examples of the sand texture in Fig. 2 with [14] (spectrum only) or in Fig. 6 without the patches sparsity term (spectrum and histogram).

The intermediate cases (a granular texture with some sharp patterns) are well synthesized with a balanced weighting \(\alpha _s= \tilde{\alpha }_p\).

Fig. 9
figure 9

Synthesis results with different weighting: more spectrum on the left, more patch sparsity on the right. Textures with few geometrical content (top) are better reproduced with more spectrum; textures with large structures (bottom) need more patches sparsity

4.3.3 Sparsity

Figure 10 illustrates the effect of the sparsity parameter \(S\).

A larger parameter \(S\) means that the patches are a linear combination of more elements from the dictionary \(D_0\). In the case of texture synthesis,

a large value of \(S\) allows the superposition of several atoms and create artifacts, particularly visible on the example in Fig. 10.

Fig. 10
figure 10

Synthesis results with different sparsity \(S=2,4,6,8\). This example illustrates the artifacts caused by the superposition of several atoms for larger \(S\).

On the contrary, the smallest value \(S=1\) imposes each patch to be proportional to \(1\) atom from \(D_0\). Imposing \(S=1\) and binary weights \(W\in \left\{ 0,1\right\} ^{N\times K}\) is an interesting alternative. It forces each patch to be equal to an atom of the dictionary. The normalization constraint \(\left\| d_n\right\| _2\le 1\) is no longer necessary and should be removed in this case.

Within this setting, the learning stage (9) becomes a \(K\)-means algorithm. The decomposition Algorithm 1 becomes a nearest-neighbor classification, with a constraint on the number \(F^n\) of use of each atom. The dictionary is in this case a resampled and weighted version of the set of all the patches in the original image. It is more compact because it contains far less patches than the image. The nearest-neighbor search would thus be far faster than an exhaustive search. Observe that in the case of an exhaustive patch search, the synthesis method is similar to the “Texture Optimization” algorithm [20] and to the video inpainting approach of Wexler et al. [40].

4.3.4 Other Settings

Figure 11 shows the effect of several options of our method.

The set of patches \(\varPi u\) can be subsampled to improve the speed of the algorithm as explained in Sect. 3.5.

We only consider the patches lying on a grid of step \(\varDelta \) with \(\varDelta >1\). This leads to a mosaic artifact: it appears because the borders of the patches becomes visible. To avoid this artifact, we use random translations of the grid at each iteration: the border of the patches are not always at the same relative position, and the mosaic artifact disappears. Figure 11 shows that using \(\varDelta =4\) with random offset gives a similar result than when using \(\varDelta =1\), while being \(16\) times faster.

Alternated projections on the sets \({\fancyscript{C}_{\mathrm {h}}}\), \({\fancyscript{C}_{\mathrm {s}}}\), and \({\fancyscript{C}_{\mathrm {p}}}\) (Sect. 3.3) leads to the images in Fig. 11 (right). The result is quite similar to our result (\(2^\text {nd}\) image from the right) but has some artifacts: some small edges appear, and some grain is missing. More generally, the results obtained using this alternated projection algorithm are sometimes similar to ours, and sometimes suffer from several kinds of artifacts, which point out the bad convergence of that method.

The post-processing step, that is the final projection on \({\fancyscript{C}_{\mathrm {s}}}\) and then \({\fancyscript{C}_{\mathrm {h}}}\), removes residual smoothness of the texture, due to the patch sparsity constraint. This is visible in Fig. 11 (\(3^\text {rd}\) and \(4^\text {th}\) images from the left).

Fig. 11
figure 11

Synthesis results with different settings. Top right: synthesis using all the patches (\(\varDelta =1\)) without the post-processing step (the final projection on \({\fancyscript{C}_{\mathrm {s}}}\) and then \({\fancyscript{C}_{\mathrm {h}}}\)); middle left: using only the patches on a grid of step \(\varDelta =4\); middle right: adding random translations of the grid at each iteration; bottom right: adding the post-processing step; bottom left: comparison with alternated projections instead of our gradient descent scheme

5 Conclusion

In this paper, we proposed a texture synthesis framework which puts together both a Fourier spectrum constraint and a sparsity constraint on the patches. Thanks to these two constraints, our method is able to synthesize a wide range of textures without a patch-based copy-paste approach. The spectrum controls the amount of oscillations and the grain of the synthesized image, while the sparsity of the patches handles geometrical information such as edges and sharp patterns.

We propose a variational framework to take both constraints into account. This framework is based on a non-convex energy function defined as the sum of weighted distances between the image (or some linear transform of it) and these constraints. The synthesis consists in finding local minima of this function. The proposed framework is generic and can take into account other constraints: in addition to the sparsity and spectrum constraint, we use a histogram constraint to control the color distribution.

A numerical exploration of synthesis results shows that, on contrary to copy-based approaches, our method produces a high degree of innovation, and does not copy verbatim whole parts of the input exemplar.