1 Introduction

Image decomposition is a key step in image analysis to extract two semantically meaningful components from an image. That is, given an input image f, the image decomposition problem can be written as \(f = u_{C} + v_{T}\), where the unknown \(u_{C}\) represents structural component modeling the well-structured homogeneous regions and the unknown \(v_{T}\) represents textural component defining oscillating patterns such as texture and/or noise. Image denoising and structure–texture image decomposition are two types of representative decomposition problems, focusing on how to formulate two parts with different priors. In this paper, we mainly focus on the structure–texture decomposition and the difficulties mainly reflect in two aspects. Firstly, this is an ill-posed problem since the number of unknowns is twice as many as the number of known variables, and how to employ appropriate constraints to reach a unique solution is challenging. Secondly, as there is no clear distinction between structures and textures among different images and even within one image, how to define the scales of both parts and extract structures and textures consistent with Human Visual System (HVS) is an another difficult problem.

Numerous techniques have been developed to implement decomposition models. For example, different variational techniques have been developed to separate image structures and textures by utilizing different regularizations. Early methods always applied total variation (TV) [50] and its extensions including to extract the piecewise constant structural part and characterized the textural part with different functional spaces or norms, such as TV-G [42], TV-div(\(L^{p}\)) [57], TV-\(H^{-1}\) [46], TV-\(H^{-s}\) [32], TV-div(BMO) [30], TV-\(L^{1}\) [72], RTV [66], JCAS [19]. However, these methods often over-smooth the structural component. Filtering-based methods were proposed to filter out textures from the structural image according to edge strength [12, 13, 29, 31, 51, 70, 71] or spacial scale [17, 24, 67, 75, 77]. Commonly, the above work can remove weak or smaller-scale edges and preserve the strong or larger-scale ones. However, these methods still cannot describe the distinction between image structure and texture, especially when weak edges belong to image structure, while strong ones correspond to the texture. Recently, deep neural networks (DNNs) have been proposed in the image decomposition [8, 14, 15, 28, 65]. Since there is no ground truth of image structure and texture, these methods are dependent on man-made datasets via sparse annotations or a large number of external samples and perform unstable on real-world data. In addition, STD [76] proposed a model of self-example, unsupervised and online learning, but still lost too many details of structural component due to the TV regularization, especially for the examples with undistinguishable image structures and textures.

In this paper, we take advantages of both the variational framework and sparse representation for structure–texture image decomposition. In particular, we propose to use non-convex total generalized variation regularization (NTGV) as a regularization term of image structure to capture more structural details and extract image fine-scale texture with convolutional sparse coding (CSC). To better distinguish textual edges from structural ones, both structure-aware and a texture-aware measures are incorporated. We develop an alternating minimization scheme based on alternating direction method of multipliers (ADMM) and show effectiveness of our method on several applications both quantitatively and qualitatively.

The rest of the paper is organized as follows. In Sect. 2, we review related works on structure–texture image decomposition. We present the proposed model in Sect. 3 and introduce the algorithm details in Sect. 4. A comprehensive analysis is provided in Sect. 5. In Sect. 6, we present the numerical experiments on several applications and make comparisons with the corresponding state-of-the-art methods.

2 Related work

Generally, image decomposition is also known as image layer separation including reflectance (albedo)-illumination (shading) decomposition and structure (cartoon)-texture decomposition. The former mainly involves intrinsic image decomposition and Retinex-based decomposition designed for image enhancement, surface re-texturing, object insertion and scene relighting, etc. While the latter aims to extract main information of the images successfully applied to HDR image tone mapping, detail enhancement, semantic image segmentation and non-photorealistic abstraction, etc. We have explored the total generalized variation regularization for Retinex-based decomposition in our previous work [58] and consider the non-convex TGV to separate a single image into structural and textural components in this paper. Therefore, we give a brief overview of related works on structure(cartoon)-texture decomposition in this section. For the past few decades, many implementations and improvements of structure–texture image decomposition have been studied, and we mainly discuss three classes of methodologies that are most relevant to ours: adaptive neighborhood filter, variational framework and deep learning-based method.

Adaptive neighborhood filter Image structure can be achieved easily by filtering out image details, and the well-known adaptive filters include bilateral filter [54], guided filter [22] and non-local filter [6]. To better preserve image edges, some extensions appeared subsequently, e.g., fast bilateral filter [13], gradient domain-guided filter [22], hashed non-local filter [6]. To suppress the structure halos, the filter of weighted least squares (WLS) [17] was proposed by introducing gradient-based smoothness weights and has been well applied to HDR tone mapping and detail manipulation. Inspired by WLS, Li et al. [31] proposed an edge-aware weighting strategy in the paradigm of guided filtering, the tree filtering [3] was designed by spatial, range, and tree distances to achieve strong image smoothing, and side window filtering (SWF) [70] was introduced to adaptively select the support domain of filtering. In addition, anisotropic diffusion equation [48] and zero crossings of the derivative of histogram [26] were proposed in a different mathematical form, and a pyramid of Laplacian filters [47] was presented to characterize edges by large-scale edges or small-scale details. Since the aforementioned methods are not designed to explicitly describe both image structures and textures, their performance on image decomposition is not satisfactory. Recently, more and more works [24, 67, 75, 77] focused on the spatial scale of regions to distinguish image structures and textures. The rolling guidance filter (RGF) [75] employed scale space theory [33] to remove the small-scale edges and smoothed the large-scale edges with rolling guidance method iteratively. Jeon et al. [24] used patch-based statistics to find an optimal per-pixel smoothing scale. Zhou et al. [77] measured the scale with the help of a scanline along gradient direction through iterative global optimization (IGO). Xu et al. [67] adaptively adjusted the window size by the distances from pixels to structures. The neighborhood filters on spatial scale are intuitive and easy to implement, but still cannot well distinguish image structures and textures, especially for multiple scales of complex structures and textures.

Variational framework Variational models have been widely applied to image decomposition during the last two decades, aiming to find appropriate functional spaces to describe structures and textures, respectively. One of the most notable models is the TV approach [50], and early researches discussed the TV-based models characterizing textures by different functional spaces or norms [2], involving TV-G [42], TV-div(\(L^{p}\)) [57], TV-\(H^{-1}\) [46], TV-\(H^{-s}\) [32], TV-div(BMO) [30], TV-\(L^{1}\) [72], etc. To well model textural component, sparse coding [19, 52, 74] and low rank [16] were proposed in TV-based models. However, these models often over-smooth structural component as the property of TV. Beyond the TV [50], many other priors have also been proposed. Xu et al. [64] put forward \(L^{0}\) gradient minimization to remove weak edges and retain strong edges. Similarly, Ono [45] formulated \(L^{0}\)-norm of gradients as the hard constraint solved by \(L^{0}\) projection algorithm. The model of relative total variation (RTV) [66] used a pixel-wise windowed TV normalized by a windowed inherent variation, which is treated as a great progress in describing image textures. Many scale-ware measures in image decomposition models [21, 24, 67, 77] essentially originated from RTV. Although different scale-aware measures were discussed, the resulting structure usually looked like piecewise constant, and thus structural edges were regarded as textural details. As a successful modified TV, total generalized variation (TGV) [4] has been applied in image decomposition to preserve structural edges. Following adaptive TV [34, 37], Xu et al. [63] proposed the adaptive TGV to preserve the key features such as object boundaries and Liu [35] developed a weighted TGV-Gabor model. Following Meyer’s TV-G model [42], Xu et al. [62] just used TGV regularization term in place of TV term in TV-G, aiming to capture structural component with more details by TGV, and Lu et al. [38] implemented TGV regularization term to model structural component and its dual TGV\(^{*}\) for textural component. Although they all employed TGV to reduce the staircase effect and preserve the edges, they failed to recover the repetition or oscillation property of textures.

Deep learning-based method Deep learning-based methods have been developed for image decomposition recently. The deep edge-aware filters (DEF) [65] were proposed to approximate various filters based on a deep convolutional neural network with a gradient domain training procedure. Fan et al. [14] solved this problem by constructing two cascaded sub-networks, named as edge prediction network and image reconstruction network. Chen et al. [8] used a fully convolutional network to approximate a wide variety of variational models including the TV [50], RTV [66] and \(L^{0}\) [64]. Kim et al. [28] learned the structure prior through context aggregation network (CAN) and generated training data by adding Gaussian noises to clean image patches. With the similar idea, Lu et al. [39] trained two prediction networks to highlight textural and structural parts on synthetic samples. However, all above data-driven approaches were highly relied on a large number of man-made training data. To overcome above difficulties, Fan et al. [15] employed a modified CAN on external samples and Zhou et al. [76] adopted modified U-net (MU-net) on self-example samples. Although both methods were in the unsupervised fashion, the learning variational priors still failed to distinguish image structures and textures well, especially for the complex cases.

3 NTGV + CSC decomposition model

In this section, we first review the TGV, NTGV regularization in Sect. 3.1 and CSC in 3.2 and then formulate our optimization for structure–texture image decomposition in Sect. 3.3.

3.1 Background of TGV and NTGV

TGV was first proposed by Bredies et al. [4] and has been widely applied in image processing [20, 55, 58] and geometry processing [36], written as follows:

$$\begin{aligned} \begin{aligned} \mathrm {TGV}_{\varvec{\alpha }}^{k}(u) =&\sup _{\nu } \Bigg \{ \int _{\varOmega }u\mathrm {div}^{k}(\nu )\mathrm {d}x \ | \ \nu \in \mathcal {C}_{c}^{k}(\varOmega ,\mathrm {Sym}^{k}(\mathbb {R}^{d})),\\&\Vert \mathrm {div}^{j}(\nu )\Vert _{\infty } \le \alpha _{j},\ \ j = 0,\ldots ,k-1 \Bigg \}, \end{aligned} \end{aligned}$$
(1)

where \(\varvec{\alpha } =(\alpha _{0},\alpha _{1},\ldots ,\alpha _{k-1})\) is a fixed positive vector, \(\mathrm {Sym}^{k}(\mathbb {R}^{d})\) denotes the space of symmetric k-tensors on \(\mathbb {R}^{d}\) as \(\mathrm {Sym}^{k}(\mathbb {R}^{d}) = \{ \xi :\mathbb {R}^{d} \times \cdots \times \mathbb {R}^{d}|\; \xi \, is\, k-linear\, and\, symmetric\},\) and \(\mathcal {C}_{c}^{k}(\varOmega ,\mathrm {Sym}^{k}(\mathbb {R}^{d}))\) is the space of compactly supported symmetric tensor fields.

Meanwhile, \(\mathrm {TGV}_{\varvec{\alpha }}^{k}\) can be interpreted as a k-fold infimal convolution by employing the Fenchel−Rockafellar duality formula, as follows:

$$\begin{aligned} \mathrm {TGV}_{\varvec{\alpha }}^{k}(u) = \inf _{u_{j}} \sum _{j=1}^{k}\varvec{\alpha }_{k-j}\int _{\varOmega }|\mathcal {E}(u_{j-1})-u_{j}|\mathrm {d}x, \end{aligned}$$
(2)

where \(u_{0}=u\), \(u_{k}=0\), \(u_{j}\in \mathcal {C}_{c}^{k-j}(\varOmega ,\mathrm {Sym}^{j}(\mathbb {R}^{d}))\) for \(j=1,\ldots ,k\), and \(\mathcal {E}\) represents the distributional symmetrized derivative, i.e., \(\mathcal {E}(u_{j-1})=\frac{\nabla u_{j-1} + (\nabla u_{j-1})^{T}}{2}\). Thus, \(\mathrm {TGV}_{\varvec{\alpha }}^{k}\) automatically balances the first to the k-th derivatives of u among themselves and reduces the staircasing effects of the TV.

Recently, studies [25, 43, 44] have demonstrated that the non-convex variant of TGV performs better than the convex one for image restoration by preserving edges, which can be written as follows:

$$\begin{aligned} \mathrm {NTGV}_{\varvec{\alpha }}^{k}(u) = \inf _{u_{j}} \sum _{j=1}^{k}\varvec{\alpha }_{k-j}\int _{\varOmega }\phi _{i}(|\mathcal {E}(u_{j-1})-u_{j}|)\mathrm {d}x, \end{aligned}$$
(3)

where \(\{\phi _{i}\}\) are a list of non-convex prototype functions, such as the regularized \(\ell _{p}\)-norm \(\phi _{i}(s) = (s+\varepsilon _{i})^{p}\) with \(0<p<1\) and \(\varepsilon _{i} >0\), or \(\log \)-function \(\phi _{i}(s) = \frac{1}{\mu _{i}}\log (1+\mu _{i} s)\) or \(\phi _{i}(s) = \frac{1}{\mu _{i}}\log (1+\mu _{i} s^2)\) with \(\mu _{i}>0\).

It is hard to solve the minimization problem involving the non-convex \(\ell _{p}\)-norm regularizer because finding a limiting-supergradient of \(\Vert \cdot \Vert _{q}\) at zero is difficult. Moreover, according to the researches in [25, 44, 56], the \(\log \)-function is suitable for the reconstruction of images being piecewise smooth and having some sharp jump discontinuities. Structural images are mostly piecewise-smooth; thus, we particularly utilize the second-order \(\mathrm {NTGV}\) denoted by \(\mathrm {NTGV}_{\varvec{\alpha }}^{2}\) in this work.

3.2 Background of CSC

Sparse representation encodes a signal vector \(\varvec{y}\) as the linear combination of a few atoms in a dictionary \(\varvec{D}\) by solving the following inverse problem:

$$\begin{aligned} \arg \min _{\varvec{x}}\frac{1}{2}\Vert \varvec{D}\varvec{x} - \varvec{y}\Vert ^{2}_{2} + \lambda \Vert \varvec{x}\Vert _{1}, \end{aligned}$$
(4)

Over the past two decades, this optimization was solved independently on a set of overlapping image patches covering the image[9, 53] and has achieved state-of-the-art results in various computer vision tasks [41, 49, 61, 68]. As a category of sparse representations, the CSC was first proposed by Zeiler et al. [73] to decompose the input image into N sparse feature maps by N filters, instead of sparsely representing a vector by the linear combination of dictionary atoms, replacing (4) with

$$\begin{aligned} \arg \min _{\{\varvec{x}_{m}\}}\frac{1}{2}\Vert \sum _{m}\varvec{d}_{m}*\varvec{x}_{m} - \varvec{y}\Vert ^{2}_{2} + \lambda \sum _{m}\Vert \varvec{x}_{m}\Vert _{1}, \end{aligned}$$
(5)

where \(\{\varvec{d}_{m}\}\) is a set of N dictionary filters, \(*\) denotes convolution, and \(\varvec{x}_{m}\) is a set of coefficient maps, each of which is the same size as \(\varvec{y}\).

The convolutional decomposition avoids dividing the whole image into overlapped patches and can naturally utilize the consistency prior in the decomposition procedure. The convolutional decomposition has made great progress in numerical optimization. Zeile et al. [73] adopted an alternating minimization of two splitting subproblems, where one problem is how to solve a large linear system by an iterative method, and the other problem is about a simple shrinkage. Other algorithms operating in the spatial domain have been proposed including coordinate descent [27] and a proximal gradient method [7]. Bristow et al. [5] proposed a fast CSC algorithm by considering the property of block circulant with circulant block (BCCB) matrix in the Fourier domain, and Wohlberg [60] further improved above algorithm by introducing an efficient method for solving the linear systems that is linear in the number of filters, instead of cubic.

3.3 Proposed model

Here, we present our NTGV + CSC model for structure–texture decomposition, recovering the structural component with more details by NTGV and the textural component with repetition property by CSC, for an \(m \times n\) input image \(\varvec{y}\), we characterize its structural part \(\varvec{u}\) with \(\mathrm {NTGV}_{\varvec{\alpha }}^{2}(\varvec{u})\), approximate textural part \(\varvec{v}\) by the sum of N convolutions of \(s\times s\) filter \(\varvec{f}_{i}\) and sparse feature map \(\varvec{Z}_{i}\) with size \((m+s-1)\times (m+s-1)\). The image decomposition is achieved by solving the following objective function:

$$\begin{aligned} \begin{aligned}&\min _{\varvec{u},\varvec{f},\varvec{Z}}\frac{1}{2}\Vert \varvec{y}\!-\!\varvec{u}\!-\!\sum _{i=1}^{N}\varvec{f}_{i}*\varvec{Z}_{i}\Vert _{F}^{2}\!+\!\omega _{s}\mathrm {NTGV_{\varvec{\alpha }}^{2}}(\varvec{u}) \!+ \!\lambda \sum _{i=1}^{N}\Vert \varvec{Z}_{i}\Vert _{1},\\&\ \ {\mathrm{s.t.}} \ \ \Vert \varvec{f}_{i}\Vert _{F}^{2}\le 1 \ \ for \ \ i=1,\ldots ,N \end{aligned}\nonumber \\ \end{aligned}$$
(6)

where \(\lambda \) is a positive parameter controlling the \(L_{1}\) penalty, \(\omega _{s}\) is a weight function defining on each pixel to distinguish structural edges or textural edges, and the inequality constraints on the filter \(\varvec{f}_{i}\) prevent the dictionary from absorbing all the system’s energy. The term \(\frac{1}{2}\Vert \varvec{y}-\varvec{u}-\sum _{i=1}^{N}\varvec{f}_{i}*\varvec{Z}_{i}\Vert _{F}^{2}\) is used for fidelity, and \(\mathrm {NTGV_{\varvec{\alpha }}^{2}}(\varvec{u})\) is minimized over all gradients of the deformation field \(\varvec{p}=(p_{1},p_{2})\) on image space \(\varOmega \), which reads

$$\begin{aligned} \inf _{\varvec{p} \in \mathcal {C}_{c}^{2}(\varOmega ,\mathbb {R}^{2})}\alpha _{1}\int _{\varOmega }\phi _{1}(|\nabla {\varvec{u}} - \varvec{p}|)\;{\mathrm{d}}x + \alpha _{0}\int _{\varOmega }\phi _{2}(|\mathcal {E}(\varvec{p})|)\;{\mathrm{d}}x, \end{aligned}$$
(7)

where \(\alpha _{0}\) and \(\alpha _{1}\) are positive parameters balancing between the first- and second-order derivative of \(\varvec{u}\), \(\phi _{i}(s)=\frac{1}{\rho _{i}}\log (1+\rho _{i}s)\) with the parameters \(\rho _{i}\!>\!0\) controlling the non-convexity of the first- and second-order regularization terms.

The Setting of \(\omega _{s}\) The weight function \(\omega _{s}\) plays an important role in highlighting the pixels from structural edges or textural edges and guiding a good image decomposition. Following the work [76], we introduce two measures: structure-aware measure and texture-aware measure. The former can reveal the main structural information of an image, and the latter describes the repetition property of textures.

Taking advantage of anisotropy property, structural gradients would have a dominant orientation, representing by one of the eigenvectors of the following positive semi-definite matrix \(\mathbf {J}\) [1]:

$$\begin{aligned} \mathbf {J}(i) = \begin{bmatrix} \mathbf {g}_{x}^{T}(i)\mathbf {g}_{x}(i)&{}\mathbf {g}_{x}^{T}(i)\mathbf {g}_{y}(i)\\ \mathbf {g}_{y}^{T}(i)\mathbf {g}_{x}(i)&{}\mathbf {g}_{y}^{T}(i)\mathbf {g}_{y}(i)\end{bmatrix}, \end{aligned}$$

where i indicates the location of the patch, and \(\mathbf {g}_{x}\) and \(\mathbf {g}_{y}\) are the vectors containing gradients of each pixel in the patch along the abscissa and the ordinate, respectively. The matrix \(\mathbf {J}\) has two non-negative eigenvalues denoted as \(\lambda _{1}\) and \(\lambda _{2}\), and its dominant direction can be represented by the eigenvectors corresponding to the lower one. For any patch, its anisotropy degree A can be calculated by [23]

$$\begin{aligned} A_{\mathbf {J}} = \frac{\max {(\lambda _{1},\lambda _{2})}-\min {(\lambda _{1},\lambda _{2})}}{\lambda _{1}+\lambda _{2}}, \end{aligned}$$
(8)

where \(\max (\cdot ,\cdot )\) and \(\min (\cdot ,\cdot )\) return the maximum and minimum value of their two arguments, respectively. From Eq. 8, the range of \(A_{\mathbf {J}}\) is from 0 to 1, and the larger \(A_{\mathbf {J}}\) of one patch indicates that it is more likely to be a patch of image structure. Taking edge strength into consideration, the structure-aware measure \(M_{s}\) is designed as

$$\begin{aligned} M_{s}(i) = A_{\mathbf {J}}(i)\frac{\Vert \nabla {f}(i)\Vert _{2}}{R}, \end{aligned}$$
(9)

where f(i) is the pixel intensity at pixel i, \(\Vert \nabla {f}(i)\Vert _{2}\) denotes the gradient magnitude in \(L_{2}\) norm, R is fixed as 255 for 8-bit images.

To define texture-aware measure, we employ the histogram of oriented gradients (HOG) [10] for feature extraction same as [76]. Specifically, given a pixel i, we determine repetitive patterns by going through all its nearby pixels and calculate the texture-aware measure \(M_{t}\) as follows

$$\begin{aligned} M_{t}(i) = \frac{1}{\# (C_{N(i)})}\sum _{j\in C_{N(i)}}\cos \theta _{ij}\exp (-D_{KL}(\mathbf {h}_{j}||\mathbf {h}_{i})), \end{aligned}$$
(10)

where \(\theta _{ij}\) is the angle between edge direction \(\mathbf {v}_{ij}\) and gradient direction, \(D_{KL}(\cdot ||\cdot )\) denotes the Kullback-Leibler (KL) divergence, \(\mathbf {h}_{i}\) and \(\mathbf {h}_{j}\) are the HOG features of pixels i and j, respectively, \(\# (C_{N(i)})\) counts the element number in the set consisted of pixels with noticeable gradients in the neighborhood of pixel i.

Based on above, it is straightforward to combine both structure-aware measure and texture-aware measure to guide our decomposition instead of using the same value among all pixels. For the edge pixel from structural edges, i.e., \(M_{s}\) is higher and \(M_{t}\) is lower, we hope they can be preserved as sharp as the input by reducing the impact of the \(\mathrm {NTGV}_{\varvec{\alpha }}^{2}(\varvec{u})\). For the edge pixel from textural edges, i.e., \(M_{s}\) is lower and \(M_{t}\) is higher, we hope they can be removed by increasing the impact of \(\mathrm {NTGV}_{\varvec{\alpha }}^{2}(\varvec{u})\). So the weight function \(\omega _{s}\) on pixel i is defined as

$$\begin{aligned} \omega _{s}(i) = (1-t)(1-M_{s}(i)) + tM_{t}(i), \ \ 0<t<1. \end{aligned}$$
(11)

4 Optimization algorithm for image decomposition

In this section, we develop an efficient implementation for solving problem (6). First of all, we define linear operators \(\varvec{F}_{i}\) such that \(\varvec{F}_{i}\varvec{z}_{i}=\varvec{f}_{i}*\varvec{Z}_{i}\), where \(\varvec{z}_{i}\) is the vectorization of the feature map \(\varvec{Z}_{i}\). Based on this, problem (6) is equivalent to

$$\begin{aligned} \begin{aligned}&\min _{\varvec{u},\varvec{f},\varvec{z}}\frac{1}{2}\Vert \varvec{y}-\varvec{u}-\sum _{i}^{N}\varvec{F}_{i}\varvec{z}_{i}\Vert _{F}^{2}+\omega _{s}\mathrm {NTGV_{\varvec{\alpha }}^{2}}(\varvec{u}) + \lambda \sum _{i=1}^{N}\Vert \varvec{z}_{i}\Vert _{1},\\&\ \ {\mathrm{s.t.}} \ \ \Vert \varvec{f}_{i}\Vert _{F}^{2}\le 1 for \ \ i=1,\ldots ,N \end{aligned}\nonumber \\ \end{aligned}$$
(12)

where \(\varvec{F}_{i}\) is the corresponding BCCB matrix of filter \(\varvec{f}_{i}\). Note that the three unknowns \(\varvec{u}\), \(\varvec{f}\) and \(\varvec{z}\) are coupled in (12), our idea is to update three variables alternatively and the details of each subproblem are summarized in Algorithm 1.

figure d

Updating \(\varvec{u}\) Fixing \(\{\varvec{f}_{i}\}_{i=1,\ldots ,N}\) and \(\{\varvec{z}_{i}\}_{i=1,\ldots ,N}\), we solve the \(\varvec{u}\) subproblem by optimizing the following problem:

$$\begin{aligned} \min _{\varvec{u}}\frac{1}{2}\Vert \varvec{y}-\varvec{u}-\sum _{i}^{N}\varvec{F}_{i}\varvec{z}_{i}\Vert _{F}^{2}+\omega _{s}\mathrm {NTGV_{\varvec{\alpha }}^{2}}(\varvec{u}) \end{aligned}$$
(13)

where the \(\mathrm {NTGV_{\varvec{\alpha }}^{2}}(\varvec{u})\) is introduced in Sect. 3.1. Denote \(\varvec{x} = \varvec{y}-\sum _{i}^{N}\varvec{F}_{i}\varvec{z}_{i}\), and we adopt the iteratively reweighted \(l_{1}\) algorithm (IRLA) [43]. Introducing auxiliary variables \(\varvec{d}\) and \(\varvec{w}\), we can reformulate the unconstrained problem in (13) into the following constrained problem as

$$\begin{aligned} \begin{aligned}&\min _{\varvec{u}}\frac{1}{2}\Vert \varvec{x}-\varvec{u}\Vert _{F}^{2}+\omega _{s}\alpha _{1}\phi _{1}(|\varvec{d}|) + \omega _{s}\alpha _{0}\phi _{2}(|\varvec{w}|)\\&\ \ {\mathrm{s.t.}} \ \ \varvec{d} = \nabla \varvec{u}-\varvec{p}, \varvec{w}= \mathcal {E}(\varvec{p}). \end{aligned} \end{aligned}$$
(14)

Hence, the IRLA can be applied to (14), and a convex minimization problem to update \((\varvec{u}^{k+1}, \varvec{p}^{k+1}, \varvec{d}^{k+1},\) \(\varvec{w}^{k+1})\) is given by

$$\begin{aligned} \begin{aligned}&\min _{\varvec{u},\varvec{p},\varvec{d},\varvec{w}}\left\{ \frac{1}{2}\Vert \varvec{x}-\varvec{u}\Vert _{F}^{2}+\omega _{s}\alpha _{1}\langle \tilde{\varvec{d}^{k}},|\varvec{d}|\rangle +\omega _{s}\alpha _{0}\langle \tilde{\varvec{w}^{k}},|\varvec{w}|\rangle \right\} \\&\ \ {\mathrm{s.t.}} \ \ \varvec{d} = \nabla \varvec{u}-\varvec{p}, \varvec{w}= \mathcal {E}(\varvec{p}). \end{aligned} \end{aligned}$$
(15)

where \(\tilde{\varvec{d}^{k}} = \frac{1}{\rho _{1}|\varvec{d}^{k}|+1}\) and \(\tilde{\varvec{w}^{k}} = \frac{1}{\rho _{2}|\varvec{w}^{k}|+1}\).

Problem (15) can be solved by ADMM algorithm, which yields

(16)

where the penalty parameter is a fixed constant, and the relaxation is required for the convergence of the ADMM algorithm.

To solve the above problem, we first utilize forward differences to approximate and the symmetrized derivative in the following, respectively.

where and are the circulant matrices corresponding to the forward finite difference operators with periodic boundary conditions along x-axis and y-axis, respectively.

The subproblems for and in (16) have closed- form solutions using shrink operator as follows:

$$\begin{aligned}&\varvec{d}^{k+1} = \mathrm{shrink}\left( \frac{\lambda ^{k}_{1}}{\mu } + \nabla {\varvec{u}^{k}} - \varvec{p}^{k},\frac{\omega _{s}\alpha _{1}\tilde{\varvec{d}^{k}}}{\mu }\right) , \end{aligned}$$
(17)
$$\begin{aligned}&\varvec{w}^{k+1} = \mathrm{shrink}\left( \frac{\lambda ^{k}_{2}}{\mu } + \mathcal {E}(\varvec{p}^{k}),\frac{\omega _{s}\alpha _{0}\tilde{\varvec{w}^{k}}}{\mu }\right) , \end{aligned}$$
(18)

where the shrink operator is defined as .

The subproblems for and in (16) are both least square problems, and their corresponding normal equations are as follows:

$$\begin{aligned}&(\mu D^{\intercal }D + I)\varvec{u}^{k+1} = \mu (\varvec{x}-D^{\intercal }\lambda ^{k}_{1})\nonumber \\&\quad + D^{\intercal }(\varvec{d}^{k+1} + \varvec{p}^{k+1}), \end{aligned}$$
(19)
$$\begin{aligned}&(\mathcal {E}^{*}\mathcal {E}+I)\varvec{p}^{k+1} = \mathcal {E}^{*}(\varvec{w}^{k+1}-\frac{\lambda ^{k}_{2}}{\mu })+\frac{\lambda ^{k}_{1}}{\mu }\nonumber \\&\quad +(D\varvec{u}^{k+1}-\varvec{d}^{k+1}). \end{aligned}$$
(20)

where denotes the adjoint operator of \(\mathcal {E}\).

The block matrices and can be diagonalized by the Fourier transform under the periodic boundary condition. Thus, solutions and in our algorithm can be achieved explicitly and easily using the Fourier transform and the block matrix inversion formula.

Finally, the IRLA with ADMM algorithm for -step in (16) is summarized in Algorithm 2.

figure e

Updating Fixing structural part and filters , we solve the following subproblem to obtain :

(21)

The optimization in (21) is a standard convolutional sparse coding problem, which can be solved by the ADMM algorithm in the Fourier domain to substantially reduce computational cost [60].

Updating \(\varvec{f}\) To recover \(\varvec{f}\), we first apply an efficient variable reordering \(\sum _{i}^{N}\varvec{F}_{i}\varvec{z}_{i} = \hat{\varvec{Z}}\varvec{f}\), where \(\varvec{f}\) is written as the vectorization of all the filters \(\varvec{f}_{i}, {i=1,\ldots ,N}\), \(\hat{\varvec{Z}} = [\hat{\varvec{Z}_{1}},\ldots ,\hat{\varvec{Z}_{i}},\ldots ,\hat{\varvec{Z}_{N}}]\) and \(\hat{\varvec{Z}_{i}}\) is generated by collecting the patches in \(\varvec{Z}_{i}\). The \(\varvec{f}\) subproblem can be re-written as the following equivalent form:

$$\begin{aligned} \begin{aligned}&\min _{\varvec{f}}\frac{1}{2}\Vert \varvec{y}-\varvec{u}-\hat{\varvec{Z}}\varvec{f}\Vert _{F}^{2},\\&\ \ {\mathrm{s.t.}} \ \ \Vert \varvec{f}_{i}\Vert _{F}^{2}\le 1 \ \ for \ \ i=1,\ldots ,N \end{aligned} \end{aligned}$$
(22)

Following the optimization in JCAS [19], we adopt a proximal gradient descent scheme to solve (22), which namely

$$\begin{aligned} \varvec{f}^{t+1} = Prox_{\Vert \cap \Vert \le 1}(\varvec{f}^{t}-\tau \hat{\varvec{Z}}^{\intercal }(\varvec{y}-\varvec{u}-\hat{\varvec{Z}}\varvec{f}^{t})). \end{aligned}$$
(23)

Since our proposed model is non-convex, it is important to set an appropriate initialization and the optimization order of the variables to ensure the convergence of algorithm. In our implementation, we initialize the textural part as an all-zero matrix and solve the \(\varvec{u}^{1}\) subproblem first. Then, we initialize filters \(\{\varvec{f}^{1}_{i}\}_{i=1,\ldots ,N}\) as the PCA dictionary of the residual image \(\varvec{y}-\varvec{u}^{1}\). Based on \(\varvec{u}^{1}\) and \(\varvec{f}^{1}\), we recover the feature map \(\varvec{z}^{1}\) by solving convolutional sparse coding problem (21). These steps can guarantee that texture details are estimated from the residual image gradually, while keeping the details of the structural part.

5 Analysis

5.1 Implementation details

The proposed method is implemented using MATLAB R2018b on a windows 10 platform with an Intel Corei7 at 3.2 GHz and 8 GB RAM. We fix the regularization parameter \(\lambda =10^{-2}\) in objective function discussed in Sect. 5.4. For the parameters in Algorithm 2, the default setting is as follows: \(\alpha _{0} = 0.1\), \(\alpha _{1} = 0.2\), \(\mu = 10\), \(\gamma = 1.618\), \(\rho _{1} = 0.005\), \(\rho _{2} = 0.2\) and \(\epsilon _{3} = 5e^{-4}\). To calculate \(\{\varvec{f}_{i}\}_{i=1,\ldots ,N}\), we fix the size and numbers of filters are \(5\times 5\) and 7, respectively. To calculate \(\omega _{s}\), we set balance parameter t to 0.6, the batch size of \(M_{s}\) to 3 and the neighborhood size of \(M_{t}\) to 6. Since our idea is to remove textual edges from structural image gradually, so we update \(M_{s}\) during the iterations and gradually reduce the impact of the second term by decreasing t in \(\omega _{s}\) of (11). In our implementation, \(M_{t}\) is calculated once from the input and \(M_{s}\) is re-estimated from the updated structural result of previous step.

5.2 The detail-preserving property of our model

It is worthy to note that how our model removes textural edges while keeping sharp features of structural component. Firstly, the weight function \(\omega _{s}\) plays an important role in locating structural and textural edges, which is a good guiding of our model to distinguish image structures and textures. Secondly, the non-convex TGV can well construct smooth regions without staircasing effects while preserving edges and details, especially for the images with much structures and strong edges or complex textures. Based above, our model can better preserve the details guided by \(\omega _{s}\). Here, we compare our proposed method with two most related works. Lu et al. [38] adopted the convex TGV to capture structural edges, and Gu et al. [19] employed synthesis sparse representation to capture textural edges, both methods either focused on recovering structural edges or textural edges, but cannot pursue both. We take account of both advantages and also combine the weight function \(\omega _{s}\) to guide a good image decomposition. A visual example is given in Fig. 1, and the regions of red boxes are highlighted. The input images given in Fig. 1a, b and e show the image structures from different methods. From Fig. 1b, we can see that TGV-TGV\(^{*}\) cannot remove the textural edges totally from structural component due to ignoring considering the repetition property of textures. For JCAS in Fig. 1c, small-scale structures are blurred or even removed because of the staircase effects from TV just as shown in the highlighted parts. Figure 1d shows the results of our decomposition model without \(\omega _{s}\), and NTGV improves the detail-preserving property greatly compared with JCAS. The best results are achieved by adding the weight function \(\omega _{s}\) shown in Fig. 1e, from which we can observe that the structural edges are much sharper and the textural edges are removed clearly.

Fig. 1
figure 1

Illustration on the detail-preserving property of our model. a input images; b image structures obtained by TGV–TGV\(^{*}\) [38] where TGV regularization and its dual TGV\(^{*}\) were adopted to model structural and textural component, respectively; c image structures obtained by JCAS [19] where ASR and CSC priors were integrated to separate two image layers, respectively; d image structures obtained by our decomposition model without \(\omega _{s}\); e image structures obtained by our proposed model. Our proposed model achieves more accurate and sharper structures and removes multiple-scale or extremely varying textures

5.3 Convergence analysis

Note that the three unknowns \(\varvec{u},\varvec{f},\varvec{z}\) are coupled in (6), it is effective to apply an inner alternating minimization scheme to decrease the energy monotonically. However, it should be mentioned that such an alternating scheme does not guarantee the whole sequence convergence to the minimizer of the original problem. For fixed \(\varvec{f},\varvec{z}\), the problem is non-convex to \(\varvec{u}\) in (14) and we can obtain a partial convergence of the IRLA with ADMM algorithm in (15) (see [25, 43, 44] for more details). That is, the sequence \(\{\varvec{u}^{k}, \varvec{v}^{k}, \varvec{d}^{k}, \varvec{w}^{k}\}\) generated by (15) is bounded and has at least one accumulation point. However, since the non-convex log function in (14) is not a sum of a convex function, we cannot even assure the global convergence of the algorithm in (15). For fixed \(\varvec{u},\varvec{f}\), the problem is convex to \(\varvec{z}\). And for fixed \(\varvec{u},\varvec{z}\), the problem is convex to \(\varvec{f}\). Since our objective function has a general lower bound 0, the \(\varvec{f}\) and \(\varvec{z}\) subproblems can be guaranteed to converge to their minimum, respectively.

Fig. 2
figure 2

Illustration on the convergence of Algorithm 1. Our algorithm gradually separates textures from the input image and converges to a fixed point within few iterations

Here, we compute average value of per-pixel intensity differences during iterations. Figure 2a shows the how the differences \(\Vert \varvec{u}^{k+1} - \varvec{u}^{k}\Vert ^{2}\) and \(\Vert \varvec{y}-\varvec{u}^{k}-\varvec{v}^{k}\Vert ^{2}\) change over iterations. From the figure, we can see that our algorithm converges to a fixed point within a few iterations and both differences are less than \(1.0\times 10^{-5}\) after 10 iterations. Meanwhile, Fig. 2a–d shows that textural edges are removed from structural edges gradually during iterations.

5.4 Parameters impact

The regularization parameter \(\lambda \) is important to distinguish structural edges and textural edges, controlling the sparsity of repetitive textures. In other words, the larger value of \(\lambda \) produces the more sparser textures, yielding less smoothing effects (and vice-versa). We show the resulting structural image \(\varvec{u}\) with various choices of \(\lambda \) in Fig. 3. The number of iterations K is fixed to 12 and feature map is initialized with zeros for all cases. As can be observed in Fig. 3, textures can be removed from the structural part gradually with the decreasing of \(\lambda \). Different from JCAS [19], our method can well avoid over-smoothing of structural part thanks to the weight function \(\omega _{s}\) and achieves almost the same results although the \(\lambda \) is less than \(10^{-2}\), just shown in Fig. 3e, f. Therefore, we fix \(\lambda \) to \(10^{-2}\) for almost examples. Furthermore, compared with STD [76] in Fig. 3b, both results in Fig. 3e, f contain more sharper details and overcome the staircase effect of TV regularization.

Fig. 3
figure 3

Impact of parameter \(\lambda \). a Input image; b structural image obtained by STD [76]; cf structural images obtained by our method with various choices of \(\lambda \). The smaller value of \(\lambda \) removes more textures, while almost similar results can be achieved when \(\lambda \) is less than or equal to \(\lambda =10^{-2}\)

Table 1 Average running time of different methods on 80 images of size \(350\times 350\)

5.5 Runtime

Since our algorithm is an alternating minimization scheme updating \(\varvec{u}\), \(\varvec{f}\) and \(\varvec{z}\), respectively, and the most time-consuming part is the \(\varvec{u}\) updating which consists of four subproblems. We compare the runtime of our method with those of the several decomposition methods including RTV [66], JCAS [19], IUL [15], SWF [70] and STD [76] and measure the runtime of the proposed method using gputimeit function built in MATLAB. We select 80 images with the same size of \(350\times 350\) and record the average running time over these images on the same computing machine provided in Table 1. Among the methods, IUL is the fastest due to its one forward propagation and RTV is the second since only the optimization of structures is considered. The efficiency of STD is dependent on the initialization of network, the scheme using standard initialization is time-consuming, while the one using pre-trained MU-net is comparable. Similar with traditional filters, SWF needs to perform calculations over multiple windows, and its computational cost is higher. JCAS and our method are both time-consuming mainly because both involve the structural and textural image estimation via the alternating minimization scheme. However, both methods can guarantee a fast convergence in several iterations, so we set the maximum number of iterations as 10. With code optimization and implementing C programming, the computational speed may be significantly improved.

6 Applications

In this section, we demonstrate the effectiveness of our approach on several image decomposition applications, including texture removal, HDR tone mapping, detail manipulation and non-photorealistic abstraction. Meanwhile, we compare the proposed method with several state-of-the-art methods in each application. The code of the LowRank [16] is written by ourselves, while the results of other competing methods for comparison are from the source codes or already-trained models obtained from the original authors and parameters are set to be optimal according to the original papers.

6.1 Texture removal

Texture removal is a direct application of image decomposition extensively used in image analysis, recognition and composition. To demonstrate the performance of our proposed model on textural removal, we compare it with seven popular methods. Among these competitors, RTV, LowRank and JCAS are the models of variational framework, RGF and SWF are filtering-based methods, and IUL and STD are two unsupervised methods based on DNNs. We use the test images which contain different kinds of textures in Fig. 4, and the comparison results of different methods are presented in Figs. 5 and 6.

Fig. 4
figure 4

Test images for the texture removal

Fig. 5
figure 5

Texture removal results from different methods in Fig. 4a. For each method, the first row is the result of the texture removal and the second row is the result of the corresponding texture. The corresponding zoomed parts are shown with enlargements in red rectangles. Our method effectively preserves structural edges (first row), removes textural edges (second row) and does not suffer from staircasing artifacts

Fig. 6
figure 6

More comparisons of the texture removal in Fig. 4b–c. The corresponding zoomed parts are shown with enlargements in red rectangles. Our method removes textural edges from image structures clearly and keeps much sharper structural edges

Fig. 7
figure 7

Comparison of several HDR tone-mapping methods. The corresponding zoomed parts are shown with enlargements in red rectangles. Our method achieves the best balance between detail-preserving and naturalness

In Fig. 5, we present both structural and textural components to visualize where textures are removed. The corresponding zoomed parts are shown with enlargements in red rectangles. One can observe that some structural edges are treated as textural edges removed in the textural components for most competitors. For example, the abdomen region of the man and the eye region of the horse are highlighted in Fig. 5. Visually, both parts belong to structures, but they are removed from structures leading to the smooth and blurred results in Fig. 5a–g. In Fig. 5h, our method removes the square patterns only and preserves these structural edges effectively.

For other examples, we only exhibit the results of texture removal to save spaces in Fig. 6. From Fig. 6a, we can observe that RTV always fails to distinguish the small-scale structures and textures, and blurs the structural parts. The results of LowRank method always contain different scales of textures and blocking artifacts, which is insufficient to only extract complex textures with low rank theory, as illustrated in Fig. 6b. Suffering from the staircase effect of TV, JCAS and STD methods always remove the small-scale structures and achieve piecewise constant results, as shown in Fig. 6c, e. The drawback of SWF method is that some complex textures surrounding structural edges are always incorrectly preserved as shown in Fig. 6d. Our results are exhibited in Fig. 6f, from which we can observe that image textures can be clearly removed from image structures and structural edges are separated clearly and sharply.

6.2 HDR tone mapping

Tone mapping aims to compress the intensity of a high dynamic range image, preserving the details and colors of the HDR image. To apply our method to HDR tone mapping problem, we first compute the luminance image using \(\mathbf {l}^{in} = (0.299 \mathbf {r}^{in} + 0.587 \mathbf {g}^{in} + 0.114\mathbf {b}^{in})\); then, we apply our model to decompose the logarithm of luminance \(\mathbf {l}^{in}\) into base and detail components. The output luminance \(\mathbf {l}^{out}\) is obtained from the detail component plus the base component compressed with a factor of 0.4. Finally, we adopt the color restoration [18] to reproduce chrominance information.

We compare our method with state-of-the-art tone mapping methods including filtering-based method [13, 17], optimization-based method [19] and DNNs-based methods [76]. With a better image decomposition result, tone-mapped images can better preserve structural details and also be natural-looking. Visual examples of tone mapping results are provided in Fig. 7. As can be seen, for Durand and WLS, the results are over-enhanced and look unnatural, as shown in Fig. 7b, c. By contrast, the results from JCAS and STD do not suffer from such a problem, but they always lose structural details in relatively darker regions, as shown in Fig. 7d, e. In comparison with the competitors, the results from our method achieve the best balance between detail-preserving and naturalness, as shown in Fig. 7f. For example, the cracking bark and the leaves with white dots in red rectangles are recovered well. In addition, we use the tone-mapped image quality index (TMQI) [69] to compare different methods on a subject-rated image database of 15 HDR images quantitatively. From Table 2, one can see that our method achieves the highest TMQI values on most images (10 out of 15 images).

Table 2 TMQI values of different methods on a subject-rated image database[69]
Fig. 8
figure 8

Results of detail enhancement from different methods. The corresponding zoomed parts are shown with enlargements in red rectangles. Our method can enhance the details in a better way while preserving the structural edges

Fig. 9
figure 9

Results of non-photorealistic abstraction from different methods. The corresponding zoomed parts are shown with enlargements in red rectangles. Our result fits non-photorealistic abstraction with simultaneous edge highlighting and texture suppressing

6.3 Detail enhancement

Given an image decomposition result, the detail enhancement can be achieved by superimposing the enhanced textural component on the structural component. Following the implementation of SWF [70] for detail enhancement, we obtained the enhanced image by \(\varvec{y}_{e} = \varvec{u} + (\alpha +1)\varvec{v}\), where \(\alpha \) is an amplification parameter and is fixed to 5 in all experiments of this application. Here, the compared methods include RTV [66], JCAS [19], EGIF [40] and SWF [70]. If structural part and textural part are separated in a satisfactory way, edges of image details would be much sharper, while structural edges would be better preserved from being over-sharpened or overshot. On the contrary, if structural edges are extracted in textural part, they can be enhanced as the textural edges and halo artifacts will appear.

Visual examples of detail-enhanced experiments are provided in Fig. 8. As can be observed, for the results from RTV, JCAS and EGIF shown in Fig. 8b, c and d, respectively, structural edges of the leaves are over-enhanced as the structural edges are incorrectly separated in the textural component. Instead, SWF does not suffer from such a problem, but the detail edges of SWF in Fig. 8e are under-enhanced for the reason that the details needed to be enhanced are extracted in the structural component. In comparison with the competitors, our method can enhance the details in a better way while preserving the structural edges, as shown in Fig. 8f.

6.4 Non-photorealistic abstraction

Non-photorealistic abstraction aims to decrease the complexity of the scene while protecting important structures. Following the traditional methods [11, 59], we perform the edge extraction with difference-of-Gaussian (DoG) on the structural component and highlight extracted edges to achieve non-photorealistic abstraction effect. If the structural edges are incorrectly removed in textural component, the structural component always is too over-smooth to detect in the abstraction results. Meanwhile, if the textural edges are incorrectly preserved in the structural component, redundant edge maps will appear in the abstraction results.

Figure 9 shows the visual examples of non-photorealistic abstraction results. It can be found that, for the methods of JCAS and IUL, as shown in Fig. 9b, c, respectively, the color and edges are over-smooth. For SWF, the textural edges retained clearly in the structural component lead to the mussy edges of abstraction results, as shown in Fig. 9d. For STD, the structural edges are more sharper than above, but still lose several structural details such as the white petals shown in Fig. 9e. However, our method removes the repetitive textures from the structural component but never smooth the structural edges, such as the colorful leaves shown in Fig. 9f. Furthermore, we recover the white petals and blue flower cores in a better way.

7 Conclusion

In this paper, we have presented an efficient structure–texture image decomposition model, which took advantages of both NTGV and CSC to separate the complex textures from richly detailed structural component. Meanwhile, the weight function has been incorporated to better distinguish structural and textural edges. We solved the model by an alternating minimization algorithm based on ADMM and gave a comprehensive analysis to the proposed algorithm. The experimental results have demonstrated the effectiveness of our approach on several applications including texture removal, HDR image tone mapping, detail enhancement and non-photorealistic abstraction.

In the future, we hope to increase the efficiency of our algorithm by GPU programming or DNNs. How to extend these techniques to decompose the digital images of wall paintings full of various diseases will also be a future work.