1 Introduction

Single image super-resolution (SR) is an active research topic with several important applications. It aims to enhance the resolution of a given image by adding missing high-frequency information. Super-resolution is therefore a fundamentally ill-posed problem. In fact, for a given low-resolution (LR) image, there exist infinitely many compatible high-resolution (HR) predictions. This poses severe challenges when designing deep learning based super-resolution approaches.

Initial deep learning approaches  [11, 12, 19, 21, 23] employ feed-forward architectures trained using standard \(L_2\) or \(L_1\) reconstruction losses. While these methods achieve impressive PSNR, they tend to generate blurry predictions. This shortcoming stems from discarding the ill-posed nature of the SR problem. The employed \(L_2\) and \(L_1\) reconstruction losses favor the prediction of an average over the plausible HR solutions, leading to the significant reduction of high-frequency details. To address this problem, more recent approaches  [2, 15, 22, 38, 46, 53] integrate adversarial training and perceptual loss functions. While achieving sharper images with better perceptual quality, such methods only predict a single SR output, which does not fully account for the ill-posed nature of the SR problem.

We address the limitations of the aforementioned approaches by learning the conditional distribution of plausible HR images given the input LR image. To this end, we design a conditional normalizing flow [10, 37] architecture for image super-resolution. Thanks to the exact log-likelihood training enabled by the flow formulation, our approach can model expressive distributions over the HR image space. This allows our network to learn the generation of photo-realistic SR images that are consistent with the input LR image, without any additional constraints or losses. Given an LR image, our approach can sample multiple diverse SR images from the learned distribution. In contrast to conventional methods, our network can thus explore the space of SR images (see Fig. 1).

Fig. 1.
figure 1

While prior work trains a deterministic mapping, SRFlow learns the distribution of photo-realistic HR images for a given LR image. This allows us to explicitly account for the ill-posed nature of the SR problem, and to sample diverse images. (\(8\times \) upscaling)

Compared to standard Generative Adversarial Network (GAN) based SR approaches  [22, 46], the proposed flow-based solution exhibits a few key advantages. First, our method naturally learns to generate diverse SR samples without suffering from mode-collapse, which is particularly problematic in the conditional GAN setting [17, 29]. Second, while GAN-based SR networks require multiple losses with careful parameter tuning, our network is stably trained with a single loss: the negative log-likelihood. Third, the flow network employs a fully invertible encoder, capable of mapping any input HR image to the latent flow-space and ensuring exact reconstruction. This allows us to develop powerful image manipulation techniques for editing the predicted SR or any existing HR image.

Contributions: We propose SRFlow, a flow-based super-resolution network capable of accurately learning the distribution of realistic HR images corresponding to the input LR image. In particular, the main contributions of this work are as follows: (i) We are the first to design a conditional normalizing flow architecture that achieves state-of-the-art super-resolution quality. (ii) We harness the strong HR distribution learned by SRFlow to develop novel techniques for controlled image manipulation and editing. (iii) Although only trained for super-resolution, we show that SRFlow is capable of image denoising and restoration. (iv) Comprehensive experiments for face and general image super-resolution show that our approach outperforms state-of-the-art GAN-based methods for both perceptual and reconstruction-based metrics.

2 Related Work

Single Image SR: Super-resolution has long been a fundamental challenge in computer vision due to its ill-posed nature. Early learning-based methods mainly employed sparse coding based techniques  [8, 41, 51, 52] or local linear regression  [43, 45, 49]. The effectiveness of example-based deep learning for super-resolution was first demonstrated by SRCNN  [11], which further led to the development of more effective network architectures  [12, 19, 21, 23]. However, these methods do not reproduce the sharp details present in natural images due to their reliance on \(L_2\) and \(L_1\) reconstruction losses. This was addressed in URDGN  [53], SRGAN  [22] and more recent approaches  [2, 15, 38, 46] by adopting a conditional GAN based architecture and training strategy. While these works aim to predict one example, we undertake the more ambitious goal of learning the distribution of all plausible reconstructions from the natural image manifold.

Stochastic SR: The problem of generating diverse super-resolutions has received relatively little attention. This is partly due to the challenging nature of the problem. While GANs provide an method for learning a distribution over data  [14], conditional GANs are known to be extremely susceptible to mode collapse since they easily learn to ignore the stochastic input signal  [17, 29]. Therefore, most conditional GAN based approaches for super-resolution and image-to-image translation resort to purely deterministic mappings  [22, 35, 46]. A few recent works [4, 7, 30] address GAN-based stochastic SR by exploring techniques to avoid mode collapse and explicitly enforcing low-resolution consistency. In contrast to those works, we design a flow-based architecture trained using the negative log-likelihood loss. This allows us to learn the conditional distribution of HR images, without any additional constraints, losses, or post-processing techniques to enforce low-resolution consistency. A different line of research  [6, 39, 40] exploit the internal patch recurrence by only training the network on the input image itself. Recently  [39] employed this strategy to learn a GAN capable of stochastic SR generation. While this is an interesting direction, our goal is to exploit large image datasets to learn a general distribution over the image space.

Normalizing Flow: Generative modelling of natural images poses major challenges due to the high dimensionality and complex structure of the underlying data distribution. While GANs  [14] have been explored for several vision tasks, Normalizing Flow based models  [9, 10, 20, 37] have received much less attention. These approaches parametrize a complex distribution \(p_\mathbf {y}(\mathbf {y}|{\varvec{\theta }})\) using an invertible neural network \(f_{\varvec{\theta }}\), which maps samples drawn from a simple (e.g. Gaussian) distribution \(p_{\mathbf {z}}(\mathbf {z})\) as \(\mathbf {y}=f^{-1}_{\varvec{\theta }}(\mathbf {z})\). This allows the exact negative log-likelihood \(-\log p_\mathbf {y}(\mathbf {y}|{\varvec{\theta }})\) to be computed by applying the change-of-variable formula. The network can thus be trained by directly minimizing the negative log-likelihood using standard SGD-based techniques. Recent works have investigated conditional flow models for point cloud generation  [36, 50] as well as class  [24] and image  [3, 48] conditional generation of images. The latter works  [3, 48] adapt the widely successful Glow architecture  [20] to conditional image generation by concatenating the encoded conditioning variable in the affine coupling layers  [9, 10]. The concurrent work  [48] consider the SR task as an example application, but only addressing \(2\times \) magnification and without comparisons with state-of-the-art GAN-based methods. While we also employ the conditional flow paradigm for its theoretically appealing properties, our work differs from these previous approaches in several aspects. Our work is first to develop a conditional flow architecture for SR that provides favorable or superior results compared to state-of-the-art GAN-based methods. Second, we develop powerful flow-based image manipulation techniques, applicable for guided SR and to editing existing HR images. Third, we introduce new training and architectural considerations. Lastly, we demonstrate the generality and strength of our learned image posterior by applying SRFlow to image restoration tasks, unseen during training.

3 Proposed Method: SRFlow

We formulate super-resolution as the problem of learning a conditional probability distribution over high-resolution images, given an input low-resolution image. This approach explicitly addresses the ill-posed nature of the SR problem by aiming to capture the full diversity of possible SR images from the natural image manifold. To this end, we design a conditional normalizing flow architecture, allowing us to learn rich distributions using exact log-likelihood based training.

3.1 Conditional Normalizing Flows for Super-Resolution

The goal of super-resolution is to predict higher-resolution versions \(\mathbf {y}\) of a given low-resolution image \(\mathbf {x}\) by generating the absent high-frequency details. While most current approaches learn a deterministic mapping \(\mathbf {x}\mapsto \mathbf {y}\), we aim to capture the full conditional distribution \(p_{\mathbf {y}|\mathbf {x}}(\mathbf {y}| \mathbf {x}, {\varvec{\theta }})\) of natural HR images \(\mathbf {y}\) corresponding to the LR image \(\mathbf {x}\). This constitutes a more challenging task, since the model must span a variety of possible HR images, instead of just predicting a single SR output. Our intention is to train the parameters \({\varvec{\theta }}\) of the distribution in a purely data-driven manner, given a large set of LR-HR training pairs \(\{(\mathbf {x}_i, \mathbf {y}_i)\}_{i=1}^M\).

The core idea of normalizing flow [9, 37] is to parametrize the distribution \(p_{\mathbf {y}|\mathbf {x}}\) using an invertible neural network \(f_{{\varvec{\theta }}}\). In the conditional setting, \(f_{{\varvec{\theta }}}\) maps an HR-LR image pair to a latent variable \(\mathbf {z}= f_{{\varvec{\theta }}}(\mathbf {y}; \mathbf {x})\). We require this function to be invertible w.r.t. the first argument \(\mathbf {y}\) for any LR image \(\mathbf {x}\). That is, the HR image \(\mathbf {y}\) can always be exactly reconstructed from the latent encoding \(\mathbf {z}\) as \(\mathbf {y}= f_{{\varvec{\theta }}}^{-1}(\mathbf {z}; \mathbf {x})\). By postulating a simple distribution \(p_{\mathbf {z}}(\mathbf {z})\) (e.g. a Gaussian) in the latent space \(\mathbf {z}\), the conditional distribution \(p_{\mathbf {y}|\mathbf {x}}(\mathbf {y}| \mathbf {x}, {\varvec{\theta }})\) is implicitly defined by the mapping \(\mathbf {y}= f_{{\varvec{\theta }}}^{-1}(\mathbf {z}; \mathbf {x})\) of samples \(\mathbf {z}\sim p_{\mathbf {z}}\). The key aspect of normalizing flows is that the probability density \(p_{\mathbf {y}|\mathbf {x}}\) can be explicitly computed as,

$$\begin{aligned} p_{\mathbf {y}|\mathbf {x}}(\mathbf {y}| \mathbf {x}, {\varvec{\theta }}) = p_{\mathbf {z}}\big (f_{{\varvec{\theta }}}(\mathbf {y}; \mathbf {x})\big ) \left| \det \frac{\partial f_{{\varvec{\theta }}}}{\partial \mathbf {y}}(\mathbf {y}; \mathbf {x}) \right| \,. \end{aligned}$$
(1)

It is derived by applying the change-of-variables formula for densities, where the second factor is the resulting volume scaling given by the determinant of the Jacobian \(\frac{\partial f_{{\varvec{\theta }}}}{\partial \mathbf {y}}\). The expression (1) allows us to train the network by minimizing the negative log-likelihood (NLL) for training samples pairs \((\mathbf {x}, \mathbf {y})\),

$$\begin{aligned} \mathcal {L}({\varvec{\theta }}; \mathbf {x}, \mathbf {y}) = - \log p_{\mathbf {y}|\mathbf {x}}(\mathbf {y}| \mathbf {x}, {\varvec{\theta }}) = -\log p_{\mathbf {z}}\big (f_{{\varvec{\theta }}}(\mathbf {y}; \mathbf {x})\big ) - \log \left| \det \frac{\partial f_{{\varvec{\theta }}}}{\partial \mathbf {y}}(\mathbf {y}; \mathbf {x}) \right| \,. \end{aligned}$$
(2)

HR image samples \(\mathbf {y}\) from the learned distribution \(p_{\mathbf {y}|\mathbf {x}}(\mathbf {y}| \mathbf {x}, {\varvec{\theta }})\) are generated by applying the inverse network \(\mathbf {y}= f_{{\varvec{\theta }}}^{-1}(\mathbf {z}; \mathbf {x})\) to random latent variables \(\mathbf {z}\sim p_{\mathbf {z}}\).

In order to achieve a tractable expression of the second term in (2), the neural network \(f_{{\varvec{\theta }}}\) is decomposed into a sequence of N invertible layers \(\mathbf {h}^{n+1} = f_{{\varvec{\theta }}}^n(\mathbf {h}^{n}; g_{{\varvec{\theta }}}(\mathbf {x}))\), where \(\mathbf {h}^0 = \mathbf {y}\) and \(\mathbf {h}^N = \mathbf {z}\). We let the LR image to first be encoded by a shared deep CNN \(g_{{\varvec{\theta }}}(\mathbf {x})\) that extracts a rich representation suitable for conditioning in all flow-layers, as detailed in Sect. 3.3. By applying the chain rule along with the multiplicative property of the determinant  [10], the NLL objective in (2) can be expressed as

$$\begin{aligned} \mathcal {L}({\varvec{\theta }}; \mathbf {x}, \mathbf {y}) = -\log p_{\mathbf {z}}(\mathbf {z}) - \sum _{n=0}^{N-1} \log \left| \det \frac{\partial f_{{\varvec{\theta }}}^n}{\partial \mathbf {h}^{n}}(\mathbf {h}^{n}; g_{{\varvec{\theta }}}(\mathbf {x})) \right| \,. \end{aligned}$$
(3)

We thus only need to compute the log-determinant of the Jacobian \(\frac{\partial f_{{\varvec{\theta }}}^n}{\partial \mathbf {h}^{n}}\) for each individual flow-layer \(f_{{\varvec{\theta }}}^n\). To ensure efficient training and inference, the flow layers \(f_{{\varvec{\theta }}}^n\) thus need to allow efficient inversion and a tractable Jacobian determinant. This is further discussed next, where we detail the employed conditional flow layers \(f_{{\varvec{\theta }}}^n\) in our SR architecture. Our overall network architecture for flow-based super-resolution is depicted in Fig. 2.

3.2 Conditional Flow Layers

The design of flow-layers \(f_{{\varvec{\theta }}}^n\) requires care in order to ensure a well-conditioned inverse and a tractable Jacobian determinant. This challenge was first addressed in  [9, 10] and has recently spurred significant interest  [5, 13, 20]. We start from the unconditional Glow architecture  [20], which is itself based on the RealNVP  [10]. The flow layers employed in these architectures can be made conditional in a straight-forward manner  [3, 48]. We briefly review them here along with our introduced Affine Injector layer.

Conditional Affine Coupling: The affine coupling layer  [9, 10] provides a simple and powerful strategy for constructing flow-layers that are easily invertible. It is trivially extended to the conditional setting as follows,

$$\begin{aligned} \mathbf {h}^{n+1}_A = \mathbf {h}^n_A \;,\qquad \mathbf {h}^{n+1}_B = \exp \big (f_{{\varvec{\theta }},\text {s}}^n(\mathbf {h}^n_A; \mathbf {u})\big ) \cdot \mathbf {h}^n_B + f_{{\varvec{\theta }},\text {b}}^n(\mathbf {h}^n_A; \mathbf {u}) \,. \end{aligned}$$
(4)

Here, \(\mathbf {h}^n = (\mathbf {h}^n_A, \mathbf {h}^n_B)\) is a partition of the activation map in the channel dimension. Moreover, \(\mathbf {u}\) is the conditioning variable, set to the encoded LR image \(\mathbf {u}= g_{{\varvec{\theta }}}(\mathbf {x})\) in our work. Note that \(f_{{\varvec{\theta }},\text {s}}^n\) and \(f_{{\varvec{\theta }},\text {b}}^n\) represent arbitrary neural networks that generate the scaling and bias of \(\mathbf {h}^n_B\). The Jacobian of (4) is triangular, enabling the efficient computation of its log-determinant as \(\sum _{ijk} f_{{\varvec{\theta }},\text {s}}^n(\mathbf {h}^n_A; \mathbf {u})_{ijk}\).

Fig. 2.
figure 2

SRFlow’s conditional normalizing flow architecture. Our model consists of an invertible flow network \(f_{{\varvec{\theta }}}\), conditioned on an encoding (green) of the low-resolution image. The flow network operates at multiple scale levels (gray). The input is processed through a series of flow-steps (blue), each consisting of four different layers. Through exact log-likelihood training, our network learns to transform a Gaussian density \(p_{\mathbf {z}}(\mathbf {z})\) to the conditional HR-image distribution \(p_{\mathbf {y}|\mathbf {x}}(\mathbf {y}|\mathbf {x},{\varvec{\theta }})\). During training, an LR-HR \((\mathbf {x},\mathbf {y})\) image pair is input in order to compute the negative log-likelihood loss. During inference, the network operates in the reverse direction by inputting the LR image along with a random variables \(\mathbf {z}= (\mathbf {z}_l)_{l=1}^L \sim p_{\mathbf {z}}\), which generates sample SR images from the learned distribution \(p_{\mathbf {y}|\mathbf {x}}\). (Color figure online)

Invertible \(1 \times 1\) Convolution: General convolutional layers are often intractable to invert or evaluate the determinant of. However, [20] demonstrated that a \(1 \times 1\) convolution \(\mathbf {h}^{n+1}_{ij} = W \mathbf {h}^{n}_{ij}\) can be efficiently integrated since it acts on each spatial coordinate (ij) independently, which leads to a block-diagonal structure. We use the non-factorized formulation in [20].

Actnorm: This provides a channel-wise normalization through a learned scaling and bias. We keep this layer in its standard un-conditional form  [20].

Squeeze: It is important to process the activations at different scales in order to capture correlations and structures over larger distances. The squeeze layer  [20] provides an invertible means to halving the resolution of the activation map \(\mathbf {h}^n\) by reshaping each spatial \(2\times 2\) neighborhood into the channel dimension.

Affine Injector: To achieve more direct information transfer from the low-resolution image encoding \(\mathbf {u}= g_{{\varvec{\theta }}}(\mathbf {x})\) to the flow branch, we additionally introduce the affine injector layer. In contrast to the conditional affine coupling layer, our affine injector layer directly affects all channels and spatial locations in the activation map \(\mathbf {h}^n\). This is achieved by predicting an element-wise scaling and bias using only the conditional encoding \(\mathbf {u}\),

$$\begin{aligned} \mathbf {h}^{n+1} = \exp \!\big (f_{{\varvec{\theta }},\text {s}}^n(\mathbf {u})\big ) \cdot \mathbf {h}^n + f_{{\varvec{\theta }},\text {b}}(\mathbf {u}) \,. \end{aligned}$$
(5)

Here, \(f_{{\varvec{\theta }},\text {s}}\) and \(f_{{\varvec{\theta }},\text {s}}\) can be any network. The inverse of (5) is trivially obtained as \(\mathbf {h}^n = \exp (-f_{{\varvec{\theta }},\text {s}}^n(\mathbf {u})) \cdot (\mathbf {h}^{n+1} - f_{{\varvec{\theta }},\text {b}}^n(\mathbf {u}))\) and the log-determinant is given by \(\sum _{ijk} f_{{\varvec{\theta }},\text {s}}^n(\mathbf {u})_{ijk}\). Here, the sum ranges over all spatial ij and channel indices k.

3.3 Architecture

Our SRFlow architecture, depicted in Fig. 2, consists of the invertible flow network \(f_{{\varvec{\theta }}}\) and the LR encoder \(g_{{\varvec{\theta }}}\). The flow network is organized into L levels, each operating at a resolution of \(\frac{H}{2^l} \times \frac{W}{2^l}\), where \(l \in \{1, \ldots , L\}\) is the level number and \(H \times W\) is the HR resolution. Each level itself contains K number of flow-steps.

Flow-Step: Each flow-step in our approach consists of four different layers, as visualized in Fig. 2. The Actnorm if applied first, followed by the \(1\times 1\) convolution. We then apply the two conditional layers, first the Affine Injector followed by the Conditional Affine Coupling.

Level Transitions: Each level first performs a squeeze operation that effectively halves the spatial resolution. We observed that this layer can lead to checkerboard artifacts in the reconstructed image, since it is only based on pixel re-ordering. To learn a better transition between the levels, we therefore remove the conditional layers first few flow steps after the squeeze (see Fig. 2). This allows the network to learn a linear invertible interpolation between neighboring pixels. Similar to  [20], we split off \(50\%\) of the channels before the next squeeze layer. Our latent variables \((z_l)_{l=1}^L\) thus model variations in the image at different resolutions, as visualized in Fig. 2.

Low-Resolution Encoding Network \(g_{{\varvec{\theta }}}\): SRFlow allows for the use of any differentiable architecture for the LR encoding network \(g_{{\varvec{\theta }}}\), since it does not need to be invertible. Our approach can therefore benefit from the advances in standard feed-forward SR architectures. In particular, we adopt the popular CNN architecture based on Residual-in-Residual Dense Blocks (RRDB)  [46], which builds upon [22, 23]. It employs multiple residual and dense skip connections, without any batch normalization layers. We first discard the final upsampling layers in the RRDB architecture since we are only interested in the underlying representation and not the SR prediction. In order to capture a richer representation of the LR image at multiple levels, we additionally concatenate the activations after each RRDB block to form the final output of \(g_{{\varvec{\theta }}}\).

Details: We employ \(K=16\) flow-steps at each level, with two additional unconditional flow-steps after each squeeze layer (discussed above). We use \(L=3\) and \(L=4\) levels for SR factors \(4\times \) and \(8\times \) respectively. For general image SR, we use the standard 23-block RRDB architecture  [46] for the LR encoder \(g_{{\varvec{\theta }}}\). For faces, we reduce to 8 blocks for efficiency. The networks \(f_{{\varvec{\theta }},\text {s}}^n\) and \(f_{{\varvec{\theta }},\text {b}}^n\) in the conditional affine coupling (4) and the affine injector (5) are constructed using two shared convolutional layers with ReLU, followed by a final convolution.

3.4 Training Details

We train our entire SRFlow network using the negative log-likelihood loss (3). We sample batches of 16 LR-HR image pairs \((\mathbf {x}, \mathbf {y})\). During training, we use an HR patch size of \(160\times 160\). As optimizer we use Adam with a starting learning rate of \(5\cdot 10^{-4}\), which is halved at \(50\%, 75\%, 90\%\) and \(95\%\) of the total training iterations. To increase training efficiency, we first pre-train the LR encoder \(g_{{\varvec{\theta }}}\) using an \(L_1\) loss for 200k iterations. We then train our full SRFlow architecture using only the loss (3) for 200k iterations. Our network takes 5 d to train on a single NVIDIA V100 GPU. Further details are provided in the supplementary.

Datasets: For face super-resolution, we use the CelebA  [25] dataset. Similar to  [18, 20], we pre-process the dataset by cropping aligned patches, which are resized to the HR resolution of \(160\times 160\). We employ the full train split (160k images). For general SR, we use the same training data as ESRGAN  [46], consisting of the train split of 800 DIV2k  [1] along with 2650 images from Flickr2K. The LR images are constructed using the standard MATLAB bicubic kernel.

Fig. 3.
figure 3

Random \(8\times \) SR samples generated by SRFlow using a temperature \(\tau =0.8\). LR image is shown in top left.

Fig. 4.
figure 4

Latent space transfer from the region marked by the box to the target image. (8\(\times \))

4 Applications and Image Manipulations

In this section, we explore the use of our SRFlow network for a variety of applications and image manipulation tasks. Our techniques exploit two key advantages of our SRFlow network, which are not present in GAN-based super-resolution approaches  [46]. First, our network models a distribution \(p_{\mathbf {y}|\mathbf {x}}(\mathbf {y}| \mathbf {x}, {\varvec{\theta }})\) in HR-image space, instead of only predicting a single image. It therefore possesses great flexibility by capturing a variety of possible HR predictions. This allows different predictions to be explored using additional guiding information or random sampling. Second, the flow network \(f_{{\varvec{\theta }}}(\mathbf {y}; \mathbf {x})\) is a fully invertible encoder-decoder. Hence, any HR image \(\tilde{\mathbf {y}}\) can be encoded into the latent space as \(\tilde{\mathbf {z}} = f_{{\varvec{\theta }}}(\tilde{\mathbf {y}}; \mathbf {x})\) and exactly reconstructed as \(\tilde{\mathbf {y}} = f_{{\varvec{\theta }}}^{-1}(\tilde{\mathbf {z}}; \mathbf {x})\). This bijective correspondence allows us to flexibly operate in both the latent and image space.

4.1 Stochastic Super-Resolution

The distribution \(p_{\mathbf {y}|\mathbf {x}}(\mathbf {y}| \mathbf {x}, {\varvec{\theta }})\) learned by our SRFlow can be explored by sampling different SR predictions as \(\mathbf {y}^{(i)} = f_{{\varvec{\theta }}}^{-1}(\mathbf {z}^{(i)}; \mathbf {x}),\, \mathbf {z}^{(i)}\! \sim p_{\mathbf {z}}\) for a given LR image \(\mathbf {x}\). As commonly observed for flow-based models, the best results are achieved when sampling with a slightly lower variance  [20]. We therefore use a Gaussian \(\mathbf {z}^{(i)}\! \sim \mathcal {N}(0, \tau )\) with variance \(\tau \) (also called temperature). Results are visualized in Fig. 3 for \(\tau =0.8\). Our approach generates a large variety of SR images, including differences in e.g. hair and facial attributes, while preserving consistency with the LR image. Since our latent variables \(\mathbf {z}_{ijkl}\) are spatially localized, specific parts can be re-sampled, enabling more controlled interactive editing and exploration of the SR image.

4.2 LR-Consistent Style Transfer

Our SRFlow allows transferring the style of an existing HR image \(\tilde{\mathbf {y}}\) when super-resolving an LR image \(\mathbf {x}\). This is performed by first encoding the source HR image as \(\tilde{\mathbf {z}} = f_{{\varvec{\theta }}}(\tilde{\mathbf {y}}; d_\downarrow (\tilde{\mathbf {y}}))\), where \(d_\downarrow \) is the down-scaling operator. The encoding \(\tilde{\mathbf {z}}\) can then be used to as the latent variable for the super-resolution of \(\mathbf {x}\) as \(\mathbf {y}= f_{{\varvec{\theta }}}^{-1}(\tilde{\mathbf {z}}; \mathbf {x})\). This operation can also be performed on local regions of the image. Examples in Fig. 4 show the transfer in the style of facial characteristics, hair and eye color. Our SRFlow network automatically aims to ensure consistency with the LR image without any additional constraints.

4.3 Latent Space Normalization

We develop more advanced image manipulation techniques by taking advantage of the invertability of the SRFlow network \(f_{{\varvec{\theta }}}\) and the learned super-resolution posterior. The core idea of our approach is to map any HR image containing desired content to the latent space, where the latent statistics can be normalized in order to make it consistent with the low-frequency information in the given LR image. Let \(\mathbf {x}\) be a low-resolution image and \(\tilde{\mathbf {y}}\) be any high-resolution image, not necessarily consistent with the LR image \(\mathbf {x}\). For example, \(\tilde{\mathbf {y}}\) can be an edited version of a super-resolved image or a guiding image for the super-resolution image. Our goal is to achieve an HR image \(\mathbf {y}\), containing image content from \(\tilde{\mathbf {y}}\), but that is consistent with the LR image \(\mathbf {x}\).

The latent encoding for the given image pair is computed as \(\tilde{\mathbf {z}} = f_{{\varvec{\theta }}}(\tilde{\mathbf {y}}; \mathbf {x})\). Note that our network is trained to predict consistent and natural SR images for latent variables sampled from a standard Gaussian distribution \(p_{\mathbf {z}}= \mathcal {N}(0, I)\). Since \(\tilde{\mathbf {y}}\) is not necessarily consistent with the LR image \(\mathbf {x}\), the latent variables \(\tilde{\mathbf {z}}_{ijkl}\) do not have the same statistics as if independently sampled from \(\mathbf {z}_{ijkl} \sim \mathcal {N}(0, \tau )\). Here, \(\tau \) denotes an additional temperature scaling of the desired latent distribution. In order to achieve desired statistics, we normalize the first two moments of collections of latent variables. In particular, if \(\{z_i\}_1^N \sim \mathcal {N}(0, \tau )\) are independent, then it is well known [33] that their empirical mean \(\hat{\mu }\) and variance \(\hat{\sigma }^2\) are distributed according to,

$$\begin{aligned} \hat{\mu } = \frac{1}{N} \sum _{i=1}^N z_i \sim \mathcal {N}\left( 0, \frac{\tau }{N}\right) , \;\, \hat{\sigma }^2 = \frac{1}{N\!-\!1} \sum _{i=1}^N (z_i - \hat{\mu })^2 \sim \varGamma \left( \frac{N\!-\!1}{2}, \frac{2 \tau }{N\!-\!1}\right) . \end{aligned}$$
(6)

Here, \(\varGamma (k,\theta )\) is a gamma distribution with shape and scale parameters k and \(\theta \) respectively. For a given collection \(\tilde{\mathcal {Z}} \subset \{\mathbf {z}_{ijkl}\}\) of latent variables, we normalize their statistics by first sampling a new mean \(\hat{\mu }\) and variance \(\hat{\sigma }^2\) according to (6), where \(N = |\tilde{\mathcal {Z}}|\) is the size of the collection. The latent variables in the collection are then normalized as,

$$\begin{aligned} \hat{z} = \frac{\hat{\sigma }}{\tilde{\sigma }} (\tilde{z} - \tilde{\mu }) + \hat{\mu } \,,\quad \forall \tilde{z} \in \tilde{\mathcal {Z}} \,. \end{aligned}$$
(7)

Here, \(\tilde{\mu }\) and \(\tilde{\sigma }^2\) denote the empirical mean and variance of the collection \(\tilde{\mathcal {Z}}\).

Fig. 5.
figure 5

Image content transfer for an existing HR image (top) and an SR prediction (bottom). Content from the source is applied directly to the target. By applying latent space normalization in our SRFlow, the content is integrated and harmonized.

Fig. 6.
figure 6

Comparision of super-resolving the LR of the original and normalizing the latent space for image restoration.

The normalization in (7) can be performed using different collections \(\tilde{\mathcal {Z}}\). We consider three different strategies in this work. Global normalization is performed over the entire latent space, using \(\tilde{\mathcal {Z}} = \{\mathbf {z}_{ijkl}\}_{ijkl}\). For local normalization, each spatial position ij in each level l is normalized independently as \(\tilde{\mathcal {Z}}_{ijl} = \{\mathbf {z}_{ijkl}\}_{k}\). This better addresses cases where the statistics is spatially varying. Spatial normalization is performed independently for each feature channel k and level l, using \(\tilde{\mathcal {Z}}_{kl} = \{\mathbf {z}_{ijkl}\}_{ij}\). It addresses global effects in the image that activates certain channels, such as color shift or noise. In all three cases, normalized latent variable \(\hat{\mathbf {z}}\) is obtained by applying (7) for all collections, which is an easily parallelized computation. The final HR image is then reconstructed as \(\hat{\mathbf {y}} = f_{{\varvec{\theta }}}^{-1}(\hat{\mathbf {z}},\mathbf {x})\). Note that our normalization procedure is stochastic, since a new mean \(\hat{\mu }\) and variance \(\hat{\sigma }^2\) are sampled independently for every collection of latent variables \(\tilde{\mathcal {Z}}\). This allows us to sample from the natural diversity of predictions \(\hat{\mathbf {y}}\), that integrate content from \(\tilde{\mathbf {y}}\). Next, we explore our latent space normalization technique for different applications.

4.4 Image Content Transfer

Here, we aim to manipulate an HR image by transferring content from other images. Let \(\mathbf {x}\) be an LR image and \(\mathbf {y}\) a corresponding HR image. If we are manipulating a super-resolved image, then \(\mathbf {y}= f_{{\varvec{\theta }}}^{-1}(\mathbf {z}, \mathbf {x})\) is an SR sample of \(\mathbf {x}\). However, we can also manipulate an existing HR image \(\mathbf {y}\) by setting \(\mathbf {x}= d_\downarrow (\mathbf {y})\) to the down-scaled version of \(\mathbf {y}\). We then manipulate \(\mathbf {y}\) directly in the image space by simply inserting content from other images, as visualized in Fig. 5. To harmonize the resulting manipulated image \(\tilde{\mathbf {y}}\) by ensuring consistency with the LR image \(\mathbf {x}\), we compute the latent encoding \(\tilde{\mathbf {z}} = f_{{\varvec{\theta }}}(\tilde{\mathbf {y}}; \mathbf {x})\) and perform local normalization of the latent variables as described in Sect. 4.3. We only normalize the affected regions of the image in order to preserve the non-manipulated content. Results are shown in Fig. 5. If desired, the emphasis on LR-consistency can be reduced by training SRFlow with randomly misaligned HR-LR pairs, which allows increased manipulation flexibility (see supplement).

4.5 Image Restoration

We demonstrate the strength of our learned image posterior by applying it for image restoration tasks. Note that we here employ the same SRFlow network, that is trained only for super-resolution, and not for the explored tasks. In particular, we investigate degradations that mainly affect the high frequencies in the image, such as noise and compression artifacts. Let \(\tilde{\mathbf {y}}\) be a degraded image. Noise and other high-frequency degradations are largely removed when down-sampled \(\mathbf {x}= d_\downarrow (\tilde{\mathbf {y}})\). Thus a cleaner image can be obtained by applying any super-resolution method to \(\mathbf {x}\). However, this generates poor results since important image information is lost in the down-sampling process (Fig. 6, center).

Our approach can go beyond this result by directly utilizing the original image \(\tilde{\mathbf {y}}\). The degraded image along with its down-sampled variant are input to our SRFlow network to generate the latent variable \(\tilde{\mathbf {z}} = f_{{\varvec{\theta }}}(\tilde{\mathbf {y}}; \mathbf {x})\). We then perform first spatial and then local normalization of \(\tilde{\mathbf {z}}\), as described in Sect. 4.3. The restored image is then predicted as \(\hat{\mathbf {y}} = f_{{\varvec{\theta }}}^{-1}(\hat{\mathbf {z}},\mathbf {x})\). By, denoting the normalization operation as \(\hat{\mathbf {z}} = \phi (\tilde{\mathbf {z}})\), the full restoration mapping can be expressed as \(\hat{\mathbf {y}} = f_{{\varvec{\theta }}}^{-1}(\phi (f_{{\varvec{\theta }}}(\tilde{\mathbf {y}}; d_\downarrow (\tilde{\mathbf {y}}))),d_\downarrow (\tilde{\mathbf {y}}))\). As shown visually and quantitatively in Fig. 6, this allows us to recover a substantial amount of details from the original image Intuitively, our approach works by mapping the degraded image \(\tilde{\mathbf {y}}\) to the closest image within the learned distribution \(p_{\mathbf {y}|\mathbf {x}}(\mathbf {y}| \mathbf {x}, {\varvec{\theta }})\). Since SRFlow is not trained with such degradations, \(p_{\mathbf {y}|\mathbf {x}}(\mathbf {y}| \mathbf {x}, {\varvec{\theta }})\) mainly models clean images. Our normalization therefore automatically restores the image when it is transformed to a more likely image according to our SR distribution \(p_{\mathbf {y}|\mathbf {x}}(\mathbf {y}| \mathbf {x}, {\varvec{\theta }})\).

Table 1. Results for \(8\times \) SR of faces of CelebA. We compare using both the standard bicubic kernel and the progressive linear kernel from  [18]. We also report the diversity in the SR output in terms of the pixel standard deviation \(\sigma \).
Fig. 7.
figure 7

Comparison of our SRFlow with state-of-the-art for \(8\times \) face SR on CelebA.

5 Experiments

We perform comprehensive experiments for super-resolution of faces and of generic images in comparisons with current state-of-the-art and an ablative analysis. Applications, such as image manipulation tasks, are presented in Sect. 4, with additional results, analysis and visuals in the supplement.

Evaluation Metrics: To evaluate the perceptual distance to the Ground Truth, we report the default LPIPS  [54]. It is a learned distance metric, based on the feature-space of a finetuned AlexNet. We report the standard fidelity oriented metrics, Peak Signal to Noise Ratio (PSNR) and structural similarity index (SSIM)  [47], although they are known to not correlate well with the human perception of image quality  [16, 22, 26, 28, 42, 44]. Furthermore, we report the no-reference metrics NIQE  [32], BRISQUE  [31] and PIQUE  [34]. In addition to visual quality, consistency with the LR image is an important factor. We therefore evaluate this aspect by reporting the LR-PSNR, computed as the PSNR between the downsampled SR image and the original LR image.

Table 2. General image SR results on the 100 validation images of the DIV2K dataset.
Fig. 8.
figure 8

Comparison to state-of-the-art for general SR on the DIV2K validation set.

5.1 Face Super-Resolution

We evaluate SRFlow for face SR (\(8\times \)) using 5000 images from the test split of the CelebA dataset. We compare with bicubic, RRDB  [46], ESRGAN  [46], and ProgFSR  [18]. While the latter two are GAN-based, RRDB is trained using only \(L_1\) loss. ProgFSR is a very recent SR method specifically designed for faces, shown to outperform several prior face SR approaches in  [18]. It is trained on the full train split of CelebA, but using a bilinear kernel. For fair comparison, we therefore separately train and evaluate SRFlow on the same kernel.

Results are provided in Table 1 and Fig. 7. Since our aim is perceptual quality, we consider LPIPS the primary metric, as it has been shown to correlate much better with human opinions  [27, 54]. SRFlow achieves more than twice as good LPIPS distance compared to RRDB, at the cost of lower PSNR and SSIM scores. As seen in the visual comparisons in Fig. 7, RRDB generates extremely blurry results, lacking natural high-frequency details. Compared to the GAN-based methods, SRFlow achieves significantly better results in all reference metrics. Interestingly, even the PSNR is superior to ESRGAN and ProgFSR, showing that our approach preserves fidelity to the HR ground-truth, while achieving better perceptual quality. This is partially explained by the hallucination artifacts that often plague GAN-based approaches, as seen in Fig. 7. Our approach generate sharp and natural images, while avoiding such artifacts. Interestingly, our SRFlow achieves an LR-consistency that is even better than the fidelity-trained RRDB, while the GAN-based methods are comparatively in-consistent with the input LR image.

5.2 General Super-Resolution

Next, we evaluate our SRFlow for general SR on the DIV2K validation set. We compare SRFlow to bicubic, EDSR  [23], RRDB  [46], ESRGAN  [46], and RankSRGAN  [55]. Except for EDSR, which used DIV2K, all methods including SRFlow are trained on the train splits of DIV2K and Flickr2K (see Sect. 3.3). For the \(4\times \) setting, we employ the provided pre-trained models. Due to lacking availability, we trained RRDB and ESRGAN for \(8\times \) using the authors’ code.

EDSR and RRDB are trained using only reconstruction losses, thereby achieving inferior results in terms of the perceptual LPIPS metric (Table 2). Compared to the GAN-based methods [46, 55], our SRFlow achieves significantly better PSNR, LPIPS and LR-PSNR and favorable results in terms of PIQUE and BRISQUE. Visualizations in Fig. 8 confirm the perceptually inferior results of EDSR and RRDB, which generate little high-frequency details. In contrast, SRFlow generates rich details, achieving favorable perceptual quality compared to ESRGAN. The first row, ESRGAN generates severe discolored artifacts and ringing patterns at several locations in the image. We find SRFlow to generate more stable and consistent results in these circumstances.

Fig. 9.
figure 9

Analysis of number of flow steps and dimensionality in the conditional layers.

Table 3. Analysis of the impact of the transitional linear flow steps and the affine image injector.

5.3 Ablative Study

To ablate the depth and width, we train our network with different number of flow-steps K and hidden layers in two conditional layers (4) and (5) respectively. Figure 9 shows results on the CelebA dataset. Decreasing the number of flow-steps K leads to more artifacts in complex structures, such as eyes. Similarly, a larger number of channels leads to better consistency in the reconstruction. In Table 3 we analyze architectural choices. The Affine Image Injector increases the fidelity, while preserving the perceptual quality. We also observe the transitional linear flow steps (Sect. 3.3) to be beneficial.

6 Conclusion

We propose a flow-based method for super-resolution, called SRFlow. Contrary to conventional methods, our approach learns the distribution of photo-realistic SR images given the input LR image. This explicitly accounts for the ill-posed nature of the SR problem and allows for the generation of diverse SR samples. Moreover, we develop techniques for image manipulation, exploiting the strong image posterior learned by SRFlow. In comprehensive experiments, our approach achieves improved results compared to state-of-the-art GAN-based approaches.