1 Introduction

Medical image reconstruction can often be formulated as the following mathematical problem:

$$\begin{aligned} \varvec{f}=\varvec{A}\varvec{u}\oplus \varvec{\eta }, \end{aligned}$$
(1.1)

where \(\varvec{A}\) is a physical system modeling the image acquisition process. Operator \(\varvec{A}\) can be a linear operator or nonlinear operator that depends on the specific imaging modality. Variable \(\varvec{u}\) is the unknown image to be reconstructed, and \(\varvec{f}\) is the measured data that might be contaminated by noise \(\varvec{\eta }\) with known or partially known noise statistics, e.g., Gaussian, Laplacian, Poisson, Rayleigh, etc. The operator \(\oplus \) is a notation to denote addition when Gaussian noise is assumed, a certain nonlinear operator when Poisson noise or Rician noise is assumed. In different image reconstruction tasks, \(\varvec{A}\) takes different forms:

  • Denoising: \(\varvec{A}\) is an identity operator.

  • Deblurring: \(\varvec{A}\) is a convolution operator. When the convolution kernel is unknown, the problem is called blind deblurring [1].

  • Inpainting: \(\varvec{A}\) is a restriction operator which can be represented by a diagonal matrix with value 0 or 1 [2].

  • Magnetic resonance imaging (MRI): \(\varvec{A}\) is a subsampled Fourier transform which is a composition of the Fourier transform and a binary sampling operator [3].

  • X-ray based computed tomography (CT): \(\varvec{A}\) is a subsampled Radon transform, which is a partial collection of line integrations [4].

  • Quantitative susceptibility mapping (QSM) [5,6,7,8]: \(\varvec{A}\) is the dipole kernel

    $$\begin{aligned} \varvec{A}(X)=\frac{2z^2-x^2-y^2}{4\pi (x^2+y^2+z^2)^{5/2}},\ X=(x,y,z)\in \mathbb {R}^{3}. \end{aligned}$$

The inverse problem (1.1) is in general challenging to solve due to the large-scale and ill-posed nature of the problem in practice.

1.1 Image Reconstruction Models

The above inverse problem (1.1) covers a wide range of image restoration tasks which are not limited to medical image reconstruction. To solve the inverse problem (1.1), it is common practice to consider the following optimization problem:

(1.2)

The solution \(\varvec{u}^\star \in \arg \min _{\varvec{u}}{\mathscr {L}}(\varvec{u})\) is an approximate solution to the inverse problem (1.1). In (1.2), the term \(F(\varvec{A}\varvec{u},\varvec{f})\) is the data fidelity term that measures the consistency of the approximate solution to the measured data \(\varvec{f}\). Its specific form normally depends on the noise statistics. For example:

  • Gaussian noise: \(F(\varvec{A}\varvec{u},\varvec{f})=\frac{1}{2}\Vert \varvec{A}\varvec{u}-\varvec{f}\Vert _{2}^{2}\),

  • Poisson noise: \(F(\varvec{A}\varvec{u},\varvec{f})=\langle \varvec{1}, \varvec{A}\varvec{u}\rangle -\langle \varvec{f}, \log (\varvec{A}\varvec{u})\rangle \), with \(\langle \varvec{a},\varvec{b}\rangle =\sum _{i} \varvec{a}_{i}\varvec{b}_{i}\),

  • Impulse noise: \(F(\varvec{A}\varvec{u},\varvec{f})=\Vert \varvec{Au}-\varvec{f}\Vert _{1}\),

  • Multiplicative noise [9]: \(F(\varvec{A}\varvec{u},\varvec{f})=\lambda _{1}\left\langle \frac{\varvec{Af}}{\varvec{u}}, \varvec{1}\right\rangle +\lambda _{2}\left\| \frac{\varvec{Af}}{\varvec{u}}-\varvec{1}\right\| ^{2}\).

The second term \(\varPhi (\varvec{W},\varvec{u})\) in (1.2) is the regularization term encoding the prior knowledge on the image to be reconstructed. The regularization term is often the most crucial part of the modeling, and what people have mostly focused on in the literature. The parameter \(\lambda \) in (1.2) provides a balance between the data fidelity term and the regularization term. Mathematical modeling has been playing a vital role in solving such inverse problems. Interested readers can refer to [10,11,12,13] for more extensive reviews on mathematical models for image restoration.

Deep learning models can also be casted into the form of (1.2). However, there are differences as well. In handcraft or handcraft \(+\) data-driven modeling, the transformation \(\varvec{W}\) is often a certain linear transformation that is able to extract sparse features from the images. In handcraft models, \(\varvec{W}\) is often given by design (e.g., a differential operator or wavelet transform); in handcraft \(+\) data-driven models, \(\varvec{W}\) (or a portion of it) is often learned from the given data. Sparsity is the key to the success of these models. Deep learning models follow a similar modeling philosophy by considering nonlinear sparsifying transformations rather than linear ones. In general, we define a parameterized nonlinear mapping \(\mathscr {V}(\cdot , \varvec{\varTheta }):{\mathscr {F}}\rightarrow \mathscr {U}, \varvec{f}\mapsto \varvec{u}\) that maps the input data \(\varvec{f}\) to a high-quality output image \(\varvec{u}\). The mapping \(\mathscr {V}\) is parameterized by \(\varvec{\varTheta }\) which is trained on a given data set \({\mathscr {F}}\times \mathscr {U}\) by solving the following optimization problem:

$$\begin{aligned} \min _{\varTheta }\frac{1}{\#({\mathscr {F}}\times \mathscr {U})} \sum _{(\varvec{f},\varvec{u})\in {\mathscr {F}}\times \mathscr {U}} \mathscr {C}(\mathscr {V}(\varvec{f},\varvec{\varTheta }),\varvec{u}), \end{aligned}$$

where \(\mathscr {C}(\cdot ,\cdot )\) is a metric of difference between the approximated image \( \mathscr {V}(\varvec{f},\varvec{\varTheta })\) and the ground truth image \(\varvec{u}\), and \(\#({\mathscr {F}}\times \mathscr {U}) \) is the cardinality of the data set \({\mathscr {F}}\times \mathscr {U}\). To prevent overfitting, we can introduce a regularization term to the above optimization problem as in (1.2). We then have the problem

$$\begin{aligned} \min _{\varvec{\varTheta }}\mathscr {L}(\varvec{f},\varvec{u};\varvec{\varTheta }) =\frac{1}{\#({\mathscr {F}}\times \mathscr {U})}\sum _{(\varvec{f},\varvec{u})\in {\mathscr {F}}\times \mathscr {U}}\mathscr {C}(\mathscr {V}(\varvec{f},\varvec{\varTheta }),\varvec{u}) +\mathscr {R}(\varvec{\varTheta }), \end{aligned}$$
(1.3)

where \(\mathscr {R}(\cdot )\) is the regularization term that can be chosen as, for example, the \(\ell _{2}\) or \(\ell _{1}\) norm. Good examples of the nonlinear mapping \(\mathscr {V}(\cdot )\) include the stacked denoising autoencoder (SDAE) Vincent et al. [14], the U-Net [15], the ResNet [16, 17], etc. We postpone a detailed discussion on these networks and how to interpret them in mathematical terms in later sections.

The development of modeling in image reconstruction for the past three decades can be summarized to three stages.

  • Handcraft modeling (1990-). Models are designed based on mathematical characterizations on the desirable recovery of the image. For example, the ideal function space a “good” image should belong to, ideal local or global geometric properties the image should enjoy, or sparse approximation by certain well-designed basis functions, etc. Successful handcraft models include total variation (TV) [18] model, Perona–Malik diffusion [19, 20], shock-filters [21, 22], nonlocal methods [23,24,25,26,27], wavelet [28, 29], wavelet frames [30, 31], BM3D [27], WNNM [32], etc. These models mostly have solid theoretical foundations and high interpretability. They work reasonably well in practice, and some of them are still the state-of-the-art methods for certain tasks.

  • Handcraft\(+\)data-driven modeling (1999-). Starting from around 1999, models that combine data-driven or learning with handcraft modeling started to emerge. These models rely on some general mathematical or statistical framework by handcraft designs. However, the specific form of the model is determined by the given image data or data set. Comparing to purely handcrafted models, these models can better exploit the available data and outperform their corresponding none data-driven counterparts. Meanwhile, the handcrafted framework of the models grants certain interpretability and theoretical foundation to the models. Successful examples include the method of optimal directions [33], the K-SVD [34], learning based PDE design [35], data-driven tight frame [36, 37], Ada-frame [38], low-rank models [39,40,41,42,43], piecewise-smooth image models [44, 45], and statistical models [46], etc.

  • Deep learning models (2012-). 2012 is the year that signifies the uprise of deep learning in computer vision with the introduction of a convolutional neural network (CNN) called AlexNet [47] for image classification. Then, various types of CNNs such as ResNet [16, 17] and generative adversarial networks (GANs) [48] were introduced and applied in image reconstructions. We shall refer to these models as deep learning based models (or deep models for short). Most deep models have millions to billions of parameters. These parameters are trained (optimized) on large data sets via parallel computing (e.g., on graphics processing units (GPUs)). Deep models have greatly advanced the state of the art of many image reconstruction tasks and have changed the research landscape of computer vision in general. The success of deep models is mainly due to the presence of large image data sets with high-quality labels, and the accessibility of massive computing resources. The reliance of deep models on large labeled data sets limits, at least for the moment, the application of deep learning in medical image reconstruction or healthcare in general. The major focus of this review is to recall and discuss deep models in medical image reconstruction, and the limitations, challenges, and opportunities in this new and exciting research direction.

Note that what makes medical image reconstruction different from image restoration in computer vision is quality metrics on the reconstructed image. Although researchers use standard metrics such as the peak signal-to-noise-ratio (PSNR), mean square error, structure similarity (SSIM), meaningful quality metrics of a reconstructed medical image should be clinically relevant and task dependent. Furthermore, most medical images are 3D arrays which pose computation challenge as well.

1.2 Algorithm Design for Image Reconstruction Models

The difficulties of solving the image reconstruction models motivate the optimization community to design highly efficient numerical algorithms for large scale, nonsmooth, and even nonconvex optimization problems. Representative algorithms include the alternating direction method of multipliers (ADMM) [49,50,51], primal–dual algorithm [52,53,54], split Bregman algorithm [55, 56], linearized Bregman algorithm [57, 58], iterative shrinkage-thresholding algorithm (ISTA) [59], and fast iterative shrinkage-thresholding algorithm (FISTA) [60], among many others. Here, we briefly review some of the algorithms that will be needed in later sections.

1.2.1 ISTA and FISTA

Consider the following optimization problem which is a special case of (1.2):

$$\begin{aligned} \min _{\varvec{\alpha }} \frac{1}{2}\Vert \varvec{f}-\varvec{W}^\mathrm{T}\varvec{\alpha }\Vert _2^{2} +\lambda \Vert \varvec{\alpha }\Vert _{1}, \end{aligned}$$
(1.4)

where \(\varvec{W}^{\top } \) is a decoding operator that maps code \(\varvec{\alpha }\) back to image domain. Then, the ISTA solving (1.4) simply reads as

$$\begin{aligned} \varvec{\alpha }^{k+1}={\mathscr {T}}_{\lambda \tau _{k}}\left( \varvec{\alpha }^{k} -2\tau _{k}\varvec{W}(\varvec{W}^\mathrm{T}\varvec{\alpha }^{k}-\varvec{f})\right) , \end{aligned}$$
(1.5)

where \( \tau _{k}>0\) is an appropriate step size and the soft-thresholding operator \({\mathscr {T}}_{\lambda }(\cdot )\) is defined component-wisely as \({\mathscr {T}}_{\lambda }(x)=\mathrm{sign}(x)\max (|x|-\lambda ,0),\text{ with } \; x\in \mathbb {R}.\) ISTA was explicitly proposed in [59]. Its idea, however, can be traced back to the classical proximal forward–backward algorithm [61, 62]. Later, an accelerated version of ISTA, called fast iterative soft-thresholding algorithm (FISTA), was introduced [60, 63] which is based on the idea of Nesterov’s [64]. FISTA takes following form

$$\begin{aligned} \varvec{\alpha }^{k+1}= & {} {\mathscr {T}}_{\lambda /L_\mathrm{{lip}}}\left( \varvec{y}^{k} -\frac{1}{L_\mathrm{{lip}}}\varvec{W}(\varvec{W}^\mathrm{T}\varvec{y}^{k}-\varvec{f})\right) ,\nonumber \\ t_{k+1}= & {} \frac{1+\sqrt{1+4t_{k}^{2}}}{2},\nonumber \\ \varvec{y}^{k+1}= & {} \varvec{\alpha }^{k+1}+\frac{t_{k}-1}{t_{k+1}}(\varvec{\alpha }^{k+1} -\varvec{\alpha }^{k}), \end{aligned}$$
(1.6)

where \(L_\mathrm{{lip}}\) is the Lipschitz constant of the quadratic term in (1.4).

1.2.2 ADMM/Split Bregman Algorithm

Consider the following special case of the optimization problem (1.2)

$$\begin{aligned} \min _{\varvec{u}} {\mathscr {L}}(\varvec{u})=\frac{1}{2}\Vert \varvec{A}\varvec{u} -\varvec{f}\Vert _2^{2}+\lambda \Vert \varvec{W}\varvec{u}\Vert _1, \end{aligned}$$

which can be written equivalently as

$$\begin{aligned} \min _{\varvec{u},\varvec{d}} {\mathscr {L}}(\varvec{u},\varvec{d})=\frac{1}{2}\Vert \varvec{A}\varvec{u}-\varvec{f}\Vert _2^{2} +\lambda \Vert \varvec{d}\Vert _1\quad \mathrm{s.t}. \quad \varvec{W}\varvec{u}=\varvec{d}. \end{aligned}$$

The corresponding augmented Lagrangian function [65, Chapter 17] is defined by

$$\begin{aligned} {\mathscr {L}}(\varvec{u},\varvec{d};\varvec{b})=\frac{1}{2}\Vert \varvec{A}\varvec{u}-\varvec{f}\Vert _2^{2} +\lambda \Vert \varvec{d}\Vert _1+\langle \varvec{W}\varvec{u}-\varvec{d},\varvec{b}\rangle +\frac{\mu }{2}\Vert \varvec{W}\varvec{u}-\varvec{d}\Vert _2^{2} \end{aligned}$$

with the Lagrangian multiplier \(\varvec{b}\). Then, the ADMM or split Bregman algorithm takes the form [49, 56]

$$\begin{aligned} \varvec{u}^{k+1}= & {} \left( \varvec{A}^\mathrm{T}\varvec{A}+\mu \varvec{W}^\mathrm{T}\varvec{W}\right) ^{-1} \left[ \varvec{A}^\mathrm{T}\varvec{f}+\mu \varvec{W}^\mathrm{T}(\varvec{d}^{k}-\varvec{\nu }^{k})\right] ,\nonumber \\ \varvec{d}^{k+1}= & {} {\mathscr {T}}_{\lambda /\mu }\left( \varvec{W}\varvec{u}^{k+1}+\varvec{\nu }^{k}\right) ,\nonumber \\ \varvec{\nu }^{k+1}= & {} \varvec{\nu }^{k}+(\varvec{W}\varvec{u}^{k+1}-\varvec{d}^{k+1}), \end{aligned}$$
(1.7)

where \(\lambda \) and \(\mu \) are tuning parameters.

1.2.3 The Primal–Dual Algorithm

Consider the following optimization problem

$$\begin{aligned} \min _{\varvec{u}} F(\varvec{u})+ \varPhi (\varvec{W}\varvec{u}), \end{aligned}$$
(1.8)

where \( F(\varvec{u})\) is the data fidelity term and \(\varPhi (\varvec{W}\varvec{u}) \) is the regularization term appeared in (1.2). Assume \(F:\mathbb {R}^{n}\rightarrow (-\infty ,+\infty ]\) and \(\varPhi :\mathbb {R}^{m}\rightarrow (-\infty ,+\infty ]\) are closed proper convex functions. The problem (1.8) can be written equivalently as

$$\begin{aligned} \min _{\varvec{u}}\max _{\varvec{w}} F(\varvec{u})+ \langle \varvec{W}\varvec{u},\varvec{w}\rangle -\varPhi ^{*}(\varvec{w}). \end{aligned}$$
(1.9)

Then, the primal–dual hybrid gradient (PDHG) algorithm [52,53,54] can be written as

$$\begin{aligned} \varvec{w}^{k+1}= & {} (I+\partial \varPhi ^{*})^{-1} \left( \varvec{w}^{k}+\alpha _{k}\varvec{W}\varvec{u}^{k}\right) ,\nonumber \\ \varvec{u}^{k+1}= & {} (I+\partial F)^{-1}\left( \varvec{u}^{k} -\beta _{k}\varvec{W}^{\top }\varvec{w}^{k+1}\right) , \end{aligned}$$
(1.10)

where \(\alpha _{k}\) and \(\beta _{k}\) are tuning parameters. Note that in [54], the authors introduced an additional correction update step

$$\begin{aligned} \bar{\varvec{u}}^{k+1}=\varvec{u}^{k+1}+\theta (\varvec{u}^{k+1}-\varvec{u}^{k}) \end{aligned}$$
(1.11)

to the original PDHG algorithm (1.10) and replaced \(\varvec{u}^{k}\) in \(\varvec{w}^{k+1}\)-step by \( \bar{\varvec{u}}^{k}\).

1.2.4 SGD

It is very common in machine learning that an optimization problem takes the following form

$$\begin{aligned} \min _{\varvec{\varTheta }}F_{N}(\varvec{\varTheta })=\frac{1}{N}\sum _{i=1}^{N}f_{i}(\varvec{\varTheta }). \end{aligned}$$
(1.12)

The main computation challenge, especially in deep learning, is that N can be huge, e.g., in the magnitude of millions to billions. Therefore, the evaluation of the function value \(F_N\) and its gradient can be rather slow. In such case, stochastic gradient descent (SGD) algorithm [66,67,68,69] and its variants [70,71,72] are among the most popular algorithms in deep learning.

The very basic form of (mini-batch) SGD is

$$\begin{aligned} \varvec{\varTheta }^{k+1}=\varvec{\varTheta }^{k}-\alpha _{k}\frac{1}{|\mathscr {S}_{k}|} \sum _{i_{k}\in \mathscr {S}_{k}}\nabla f_{i_{k}}(\varvec{\varTheta }^{k}), \end{aligned}$$

where \(\alpha _{k}\) is the step size (or learning rate) and \(\mathscr {S}_{k}\) is a random subset of \(\{1,2,\cdots ,N\}\). The evaluation of \(\frac{1}{|\mathscr {S}_{k}|}\sum _{i_{k}\in \mathscr {S}_{k}}\nabla f_{i_{k}}(\varvec{\varTheta }^{k})\) provides an unbiased estimation of the full gradient and is computationally cheap. Other than SGD, numerous randomized algorithms are being used in deep learning, such as Adam [73], AdaGrad [74], RMSProp [75]. A comprehensive review on optimization algorithms for large-scale machine learning problems can be found in [76].

1.3 When Handcraft Modeling Meets Deep Learning

Both handcrafted models and deep models have their advantages and drawbacks depending on the applications. Most handcrafted models are designed with a solid mathematical foundation and can be very well interpreted. However, handcrafted models are not flexible enough to fully leverage large data sets. Deep models, on the other hand, are generally much more flexible and can better extract useful information from large data sets. However, they are generally more challenging to interpret. For the moment, they are also in lack of theoretical foundations in contrast to handcrafted models. Therefore, there has been an increasing effort in the community to combine handcrafted modeling and deep modeling so that we can enjoy benefits from both approaches.

One of the most popular ways of such combination is the so-called unrolling dynamics approach. It started with the work of Gregor and LeCun [77] where the authors showed that one could unroll the ISTA in (1.5) to create a feed-forward network. Then, one can train ISTA in an end-to-end fashion to determine the parameters in ISTA so that they are best suitable for the training data. They called the unrolled dynamics LISTA and demonstrated its advantage over ISTA. This work showed that one could unroll a discrete dynamic system to form a network for end-to-end training. More recently, more and more examples showed that the unrolling dynamics approach seems a good balance between model interpretability and efficacy. This includes unrolling discrete forms of nonlinear diffusions for image restoration [35, 78] and unrolling optimization algorithms for medical imaging and inverse problems [79,80,81,82,83,84,85]. The unrolling dynamics approach can often result in deep models that have better interpretability inherited from the original dynamics.

Furthermore, these deep models normally have much fewer trainable parameters than black box deep neural networks, which are more suitable for learning on relatively small data sets. On the other hand, we may interpret certain classes of deep convolutional networks, such as ResNet, as discrete dynamics, and hence relates deep learning with optimal control [86, 87]. Such viewpoint not only provides elegant interpretation of deep neural networks [88], but also enables us to design more effective deep networks for various tasks in machine learning [84, 89,90,91,92,93,94], computer vision [95], inverse problems [96, 97], and natural language processing [98]. More recently, intriguing relations between deep convolutional networks with multigrid method are addressed [99] which lead to new interpretations to deep models.

The remainder of this paper is organized as follows. In Sect. 2, we will review some recently proposed deep neural networks which are popular in medical imaging. Section 3 shows the understanding of deep neural networks from the perspective of representation learning and differential equations. Section 4 reviews some recently proposed deep models for medical imaging, where Sect. 4.1 presents some examples of post-processing deep models, Sect. 4.2 collects some models that are designed by combining handcrafted modeling with deep modeling, and Sect. 4.3 reviews task-driven deep models. To conclude, Sect. 5 summarizes the main challenge and opportunities in deep learning based medical imaging.

2 Review of Deep Neural Networks

Deep neural networks (DNNs) are now proven to be powerful tools to represent complex data. The main differences between DNNs and traditional machine learning models are the composite nonlinearity of the DNNs and the end-to-end training, which make DNNs very effectively in extracting features that are most suitable for a given task. In recent years, DNNs are used in various medical imaging tasks, including image reconstruction, segmentation, region-of-interest detection, super-resolution, classification, etc. In this section, we briefly recall some of the DNNs that are widely adopted in medical imaging.

2.1 ResNet

In computer vision, the residual network (ResNet) [16, 17] is one of the most popular DNNs. The architecture of ResNet is shown in Fig. 1 which can also be formulated mathematically as

$$\begin{aligned} \varvec{u}_{k+1}=\varvec{u}_{k}+{\mathscr {F}}_k(\varvec{u}_{k}), \end{aligned}$$
(2.1)

where \(\varvec{u}_{k}\) (resp. \(\varvec{u}_{k+1}\)) indicates the input (resp. output) feature map of the k-th layer of the ResNet and \({\mathscr {F}}_k(\varvec{u}_{k})\) is called a nonlinear residual block with trainable parameters. The skip connection of ResNet is crucial in facilitating stable training when the network is very deep. Other DNNs with the similar skip connections include the learned diffusion model TRD [78], DenseNet [100] and U-Net [15], among many others.

Fig. 1
figure 1

ResNet

2.2 Autoencoder

Autoencoder (AE) [101, 102] is a type of neural network that is used to learn data representation in an unsupervised manner. It aims to learn an encoder from a set of data together with a decoder so that we do not lose any essential information during the encoding and decoding process. Figure 2 presents a typical example of the AE architecture. For a given image \(\varvec{X}\), the parameterized mapping \(f_{\varvec{\theta }}\) (e.g., a fully connected or a convolutional neural network) is an encoder that extracts feature maps from \(\varvec{X}\). The encoded multi-channel feature maps are denoted by \(\varvec{Y}=f_{\varvec{\theta }}(\varvec{X})\). The encoded feature maps \(\varvec{Y}\) is then decoded by another parameterized mapping \(g_{\varvec{\theta }^{\prime }}\) to obtain the reconstructed data \(\varvec{Z}\). The parameters \(\varvec{\theta }\) and \(\varvec{\theta }'\) are optimized on a data set so that a properly chosen loss function that measures the average discrepancies between \(\varvec{X}\) and \(\varvec{Z}\) is minimized. AE resembles linear representations such as Fourier and wavelet transform if we regard encoding as the decomposition, decoding as the reconstruction, and feature maps as the coefficients of the representation. However, the representation provided by AE is nonlinear and is learned from a data set.

To learn a more effective and robust representation, Vincent et al. [14] proposed the stacked denoising autoencoder (SDAE). In SDAE, the encoder and decoder are DNNs, and they are trained to recover \(\varvec{Z}\approx \varvec{X}\) from noisy input \(\varvec{X}\). Based on the encoder/decoder framework, Badrinarayanan et al. [103] designed a DNN, called SegNet, for image segmentation. In [104], the encoder/decoder framework is adopted for image denoising and super-resolution. More recently, Chen et al. [105] designed a residual encoder–decoder CNN to suppress the noise and preserve features in low-dose CT images that are reconstructed using the filtered back projection (FBP) algorithm.

Fig. 2
figure 2

Autoencoder

2.3 U-Net

In [15], a U-shaped deep neural network, called U-Net, was proposed for biomedical image segmentation which is by far one of the most successful deep image segmentation models. The architecture of U-Net is shown in Fig. 3. It resembles the encoder/decoder architecture of AE if we view the left half of the U-Net as an encoder and the right half as a decoder. The main difference between the U-Net and AE is the use of skip connections of U-Net. Similar to the U-Net, Milletari et al. [106] designed a DNN, called V-Net, for 3D volumetric medical image segmentation. Motivated by the U-Net and the convolutional framelets [107], Ye et al. [108] designed a multi-resolution deep convolutional framelets. More recently, U-Net is extended to image analysis tasks [109].

Fig. 3
figure 3

U-Net

3 Interpretations of Deep Neural Networks

The development of traditional machine learning methods, such as support vector machine, decision tree, random forest, benefits tremendously from theoretical studies in machine learning. However, existing machine learning theory, such as PAC, VC-dimension, Rademacher complexity, may not be most suitable to analyze DNNs. Although DNNs are often composed of simple functions, such as convolutions, pooling, element-wise activation functions, the entire networks are often difficult to analyze. Therefore, theoretical deep learning has become a popular area in machine learning that has attracted a lot of attention from theoretical computer scientists, statisticians, and mathematicians. In this section, we shall review some recent works on interpreting DNNs from two different perspectives, namely representation learning and differential equations. We will see that function approximation is a powerful tool in characterizing the efficacy of the given representation. It provides a rigorous analysis of the capacity of DNNs and how well they can approximate functions living in various function spaces. The perspective through differential equations, on the other hand, is more intuitive than function approximation and can explicitly guide the design of architectures of DNNs and training algorithms. There are also several other perspectives on the theoretical interpretations of deep learning. One may refer to the course “Theories of Deep Learning” (STATS 385) hosted by David Donoho at Stanford University and the references therein (https://stats385.github.io/).

3.1 Representation Learning Perspective

Images, such as medical images or natural images, are usually assumed to have sparse (or low dimensional) structures. The sparse structures can be effectively extracted by transformations. Successful examples include the (windowed) Fourier transform, wavelet transform, curvelet transform, etc., and they are able to provide efficient representations to images. They are pre-designed linear transformations and are independent of the given image data. DNNs can also be viewed as sparse representations that are able to extract sparse features from images. The difference is that DNNs are learned from a set of images and are (highly) nonlinear.

The quality of a given representation can be measured by its ability to approximate functions living in a certain function space. For example, let \(\varPhi :=\{\phi _i(\varvec{x}):\varvec{x}\in \mathbb {R}^n, i\in \mathbb {N}_+\}\) be a set of atoms, and function \(f(\varvec{x})\) be an element in function space \(\mathscr {F}\) equipped with norm \(\Vert \cdot \Vert \). One of the most basic and important approximation properties states as follows: for any given \(\varepsilon >0\), there exists \(\tilde{f}_{\varvec{\alpha },N}:=\sum _{i=1}^N\alpha _{i} \phi _{i}(\varvec{x})\) with \(N\in \mathbb {N}_+\) and \(\varvec{\alpha }=\{\alpha _1,\cdots ,\alpha _N\}\in \mathbb {R}^{N}\) such that

$$\begin{aligned} \Vert f-\tilde{f}_{\varvec{\alpha },N}\Vert < \varepsilon . \end{aligned}$$

A good representation requires fewer atoms (i.e., smaller N) to achieve an \(\varepsilon \)-approximation. The representation of various types of \(\varPhi \) has been well studied in the literature, such as polynomials, splines, Fourier basis, and wavelets [28, 29, 110].

The neural network is a more efficient tool that can approximate a function arbitrarily well under suitable conditions [111,112,113]. Both the depth and width of a neural network are among the most important factors that affect its approximation power. In the following, we will review some of the existing characterizations of the approximation properties of shallow and deep neural networks.

Consider a shallow neural network with only one hidden layer

$$\begin{aligned} \tilde{f}_{N}(\varvec{x};\varvec{\varTheta })=\sum _{i=1}^{N} a_i\sigma (\varvec{w}_{i}^\mathrm{T}\varvec{x}+b_i), \end{aligned}$$

where \(\varvec{x}\in \mathbb {R}^{n}\) is the input image data, \(\varvec{\varTheta }=\{a_i,\varvec{w}_{i},b_{i} \}, i=1,\cdots ,N,\) are trainable parameters, and \(\sigma (z)\) is an element-wisely applied nonlinear activation function. Examples of \( \sigma (z)\) are \(\text{ ReLU }(z)=\max (0,z)\), \(\text{ tanh }(z) =\frac{\hbox {e}^{z}-\hbox {e}^{-z}}{\hbox {e}^{z}+\hbox {e}^{-z}}\), \(\text{ sigmoid }(z)=\frac{1}{1+\hbox {e}^{-z}}\) and more generally a sigmoidal function [114] that has the property

$$\begin{aligned} \sigma (z)={\left\{ \begin{array}{ll} 1, &{} \text {if}\; z\rightarrow +\infty ,\\ 0, &{} \text {if}\; z\rightarrow -\infty . \end{array}\right. } \end{aligned}$$
(3.1)

A DNN is a neural network with multiple hidden layers. It can be viewed as a successive composition of multiple shallow networks. A typical DNN (for regression problems) with depth L and width \(\varvec{N}=(N_1,N_2,\cdots ,N_L)\) denoted as

$$\begin{aligned} \tilde{f}_{L,\varvec{N}}(\varvec{x};\varvec{\varTheta }): \mathbb {R}^n\mapsto \mathbb {R}, \end{aligned}$$

can be recursively defined as \(\varvec{\varTheta }^\ell =(\varvec{\varTheta }^{\ell -1}, \theta ^{\ell })\), \(\tilde{f}_{\varvec{\varTheta }^\ell }=(\theta ^\ell \circ \sigma \circ \tilde{f}_{\varvec{\varTheta }^{\ell -1}})\), \(\theta ^\ell : \mathbb {R}^{N_\ell }\rightarrow \mathbb {R}^{N_{\ell +1}}\) with \(\theta ^{\ell }(\varvec{x})=\varvec{W}^\ell \varvec{x}+\varvec{b}^\ell \), and \(\tilde{f}_{L,\varvec{N}}:=\tilde{f}_{\varvec{\varTheta }^L}\).

Earlier results on the approximation property, i.e., universal approximation, suggest that a wide class of functions can indeed be approximated by neural networks with only one hidden layer, though the number of neurons, i.e., N, may increase exponentially as we decrease \(\varepsilon \) [114,115,116]. There are benefits in increasing the depth L of the neural network when approximating a target function. For example, approximation with DNNs leads to an exponential or polynomial reduction in the number of neurons while maintaining the same level of approximation accuracy [117,118,119,120]. Delalleau and Bengio [121], Telgarsky [122, 123] presented concrete examples that there exist functions that can be more efficiently represented with DNNs rather than shallow networks. In particular, Telgarsky [123] showed that the DNNs with \(\mathscr {O}(L^3)\) layers and constant width cannot be approximated by networks with \(\mathscr {O}(L)\) depth and less than \(2^{L}\) width.

Lu et al. [124] investigated the efficiency of depth of ReLU activated DNNs from a different view by proving that there exist classes of wide neural networks which cannot be realized by any narrow network whose depth is no more than a polynomial bound. Comparing to the known result that there are classes of deep networks which cannot be realized by any shallow network whose size is no more than an exponential bound [120], results from [124] indicated that depth might be more effective than width. Although depth is more important than width, Hanin [125, 126] proved that there is a minimum width of ReLU activated DNNs to ensure approximation of continuous functions. Their results indicated that a good DNN cannot be too narrow, otherwise we cannot approximate continuous functions even with infinite depth.

More recently, Yarotsky [127] analyzed the dependence of optimal approximation rate with respect to depth for ReLU activated DNNs. When approximating a multivariate polynomial, Rolnick and Tegmark [128] proved that the total number of neurons in DNNs should grow linearly with respect to the number of variables of the polynomial. Shen et al. [129] provided intriguing analysis on ReLU activated DNNs via nonlinear approximation of composite dictionaries. They demonstrated the advantage of depth over width quantitatively by comparing the N-term approximation order of DNNs v.s. one-hidden-layer neural networks. Other than generic DNNs, theoretical analysis on the popular ResNet was also provided [130,131,132].

In [133], the authors investigated the connection between linear finite element functions and ReLU deep neural networks. Firstly, they proposed an efficient ReLU activated DNN to represent any linear finite element functions and theoretically established that at least 2 hidden layers are needed in a ReLU activated DNN to represent any linear finite element functions in \(\varOmega \subseteq \mathbb {R}^d\) when \(d\geqslant 2\). Then, using this relationship they established a straightforward error estimate as \(\mathscr {O}(N^{-\frac{1}{d}})\) for a special kind of ReLU activated DNNs with \(\mathscr {O}(N)\) nonzero parameters by involving the h-adaptive linear finite element approximation theory [134].

Different from the approximation viewpoint, He and Xu [99] developed a unified model, known as MgNet, that simultaneously recovers and extends some CNNs for image classification and multigrid methods for solving discretized PDEs, by combining multigrid and deep learning methodologies.

3.2 Differential Equation and Control Perspective

Given a DNN \(\tilde{f}(\varvec{x};\varvec{\varTheta }): \mathbb {R}^n\rightarrow \mathbb {R}^m\), due to its composite structure as described in the previous subsection, we may view \(\tilde{f}(\varvec{x};\varvec{\varTheta })\) as an iterative mapping between \(\mathbb {R}^n\) and \(\mathbb {R}^m\). Then it is natural to view a generic DNN as a certain dynamic system [135]. However, a dynamic system that corresponds to a generic DNN is difficult to analyze since it does not have much special structure to exploit. Fortunately, it has been proven empirically that most of the effective DNNs have special structures in their architecture. In fact, designing special structures of DNNs, i.e., the architecture design, to make them easy to train and generalize better is one of the major research directions in deep learning. Furthermore, the objective of the emerging research topic neural architecture search (NAS), a subfield of automating machine learning (AutoML), is to search for effective DNN architecture for different data sets and tasks.

One of the most well-known DNNs with special structures is ResNet. Its bypasses (or shortcuts) enable us to efficiently train ultra-deep networks and achieve high accuracies in multiple tasks. The success of ResNet inspired the design of numerous new neural architectures. However, most of the design were based on empirical studies. Although we can deploy NAS to search for new architectures, the current computation burden of NAS is still prohibitively high for researchers without access to heavy computation resources, and NAS cannot guarantee to find sufficiently new and interpretable neural structures. Therefore, we direly need a way to interpret ResNet and their siblings properly and to seek for guiding principles for the architecture design of DNNs.

Recently, Weinan [86] made an inspiring observation that ResNet can be viewed as forward Euler scheme solving for an ordinary differential equation (ODE), and links training of DNNs with optimal control. Sonoda and Murata [136] and Li and Shi [88] also regarded ResNet as dynamic systems that are the characteristic lines of a transport equation on the data distribution. Similar observations were also made by Chang et al. [87, 89]. A rigorous justification of the link between ResNet and ODEs was given by Thorpe and van Gennip [137], and that of the link between deep learning and optimal control was given by Weinan et al. [138]. The dynamics and control perspective enabled us to design more efficient algorithms solving related deep learning problems [94, 95, 139, 140].

In [90], the authors suggested a general bridge between numerical ODEs and deep neural architectures by observing that multiple state-of-the-art deep network architectures, such as PolyNet [141], FractalNet [142], and RevNet [143], can be viewed as different discretizations of ODEs. Furthermore, Lu et al. [90] proved that ResNet with certain stochastic training strategies weakly approximates stochastic differential equations, which granted stochastic control perspective on randomized training of DNNs. More importantly, such new perspectives enable us to systematically design deep neural architectures through numerical (stochastic) differential equations, which is a rather mature field in applied mathematics. In this section, we shall review some of the findings of [90] and some other related works.

3.2.1 Numerical Difference Equations and Architecture Design

We first show that how ResNet is related to forward Euler scheme in numerical ODEs. Considering a building block of ResNet (2.1) as shown in Fig. 1, it can be rewritten as

$$\begin{aligned} \varvec{u}_{k+1}=\varvec{u}_{k}+\varDelta t_k {\mathscr {F}}(\varvec{u}_k, t_k), \end{aligned}$$

or equivalently as

$$\begin{aligned} \frac{\varvec{u}_{k+1}-\varvec{u}_{k}}{\varDelta t_k}={\mathscr {F}}(\varvec{u}_k,t_k), \end{aligned}$$

where \( \varDelta t_k\) is the step size and \( {\mathscr {F}}(\varvec{u}_k,t_k) =\frac{1}{\varDelta t_k}{\mathscr {F}}_{k}(u_k)\). The above formula is the forward Euler scheme solving the following ordinary differential equation (ODE):

$$\begin{aligned} \frac{\hbox {d}\varvec{u}}{\hbox {d}t}={\mathscr {F}}(\varvec{u},t). \end{aligned}$$
(3.2)

Therefore, the ResNet can be viewed as the forward Euler discretization of the ODE (3.2) with step size \(\varDelta t_k=1\) for every k. This was first observed by Weinan [86]. More recently, Zhang et al. [144] showed that there are benefits of considering ResNet with \(0<\varDelta t_k<1\).

Fig. 4
figure 4

Schematics of different neural network architectures

In [90], the authors further observed that many other DNNs with bypasses, e.g., PolyNet [141] (Fig. 4b), FractalNet [142] (Fig. 4c), and RevNet [143] (Fig. 4d), can be interpreted as certain temporal discretizations of ODEs. For example, the PolyInception module (Fig. 4b) of PolyNet can be written mathematically as

$$\begin{aligned} (I+{\mathscr {F}}+{\mathscr {F}}^2)(x)=x+{\mathscr {F}}(x)+{\mathscr {F}}({\mathscr {F}}(x)), \end{aligned}$$

where I is the identity map, \({\mathscr {F}}\) is a nonlinear operator and x is the input feature map. Note that the above polynomial of mapping \({\mathscr {F}}\) is an approximation of \((I-{\mathscr {F}})^{-1}\) using a truncated Neumann series

$$\begin{aligned} (I-\varDelta t {\mathscr {F}})^{-1}\approx (I+\varDelta t{\mathscr {F}}+\varDelta t^2{\mathscr {F}}^2). \end{aligned}$$

Therefore, PolyNet can be viewed as an approximation to the backward Euler scheme solving the ODE (3.2). FractalNet (Fig. 4c) can be viewed as approximation of the ODE (3.2) with Runge–Kutta scheme. See [90] for more examples and further details.

These examples suggest a potential link between numerical ODEs and deep neural architecture. A remaining question is whether deep neural architecture design can benefit from such perspective. The authors of [90] designed a new ResNet-like module, called the Linear Multi-step structure (LM-structure) using the linear multi-step schemes in numerical ODEs [145]. The LM-structure (linear two-step structure to be more precise) can be written mathematically as

$$\begin{aligned} \varvec{u}_{k+1}=(1-\gamma _{k})\varvec{u}_{k}+\gamma _{k}\varvec{u}_{k-1}+{\mathscr {F}}(\varvec{u}_{k},t_k), \end{aligned}$$
(3.3)

where \(\gamma _{k}\in \mathbb {R}\) is a trainable parameter in each layer. Note that when \(\gamma _k=0\) for all k, the LM-structure reduces to ResNet. Figure 4e shows the LM-structure. Empirical results of [90] showed that the LM-structure boost classification accuracies of ResNet-like DNNs on CIFAR and ImageNet. It can also reduce the depth (hence number of parameters) of ResNet-like DNNs by 50–90% without hampering accuracies. Other than the LM-structure, one can use the midpoint scheme or the leapfrog scheme to design new DNNs [89], or using the Runge–Kutta method [146].

The performance gain of the LM-structure can be explained using the concept of modified equations [147]. By Taylor’s expansion, the modified equation associated with the LM-structure (3.3) is

$$\begin{aligned} (1+\gamma _{k})\dot{\varvec{u}}_{k}+\frac{1-\gamma _{k}}{2}\varDelta t \ddot{\varvec{u}}_{k} ={\mathscr {F}}(\varvec{u}_{k},t). \end{aligned}$$
(3.4)

Comparing to ResNet, the LM-structure has the freedom to balance between \(\ddot{\varvec{u}}_{k}\) of \(\varvec{u}_{k}\). Having bigger weights on \(\ddot{\varvec{u}}_{k}\) can speed up the information propagation of the dynamics as shown by various earlier work such as [148,149,150]. This is why LM-structure can achieve comparable accuracies with a much smaller depth than ResNet and its siblings.

3.2.2 Stochastic Training and Optimal Control

Stochastic training, such as dropout and noise injection, is widely adopted in training of DNNs. It helps with the generalization of the trained networks. In [90], the authors showed that some stochastic training of ResNet, shake–shake [151], and stochastic depth [152], can be viewed as stochastic control

$$\begin{aligned}&\min _{\varvec{\varTheta }} {E}_{\varvec{X}(0)\sim {\text{ data }}}\left[ {E}[L(\varvec{X}(T))] +\int _{0}^{\mathrm{T}} R(\varvec{\varTheta })\right] \nonumber \\&\quad \mathrm{s.t}. \quad \mathrm{d}\varvec{X}={\mathscr {F}}(\varvec{X},\theta )\mathrm{d}t+\mathscr {G}(\varvec{X},\theta )\mathrm{d}\varvec{B}_{t}, \end{aligned}$$
(3.5)

where the stochastic differential equation (3.5) is the weak limit of the ResNet with shake–shake mechanism or stochastic depth. This suggests a connection between stochastic training and stochastic control, and a connection between DNNs with randomness and stochastic differential equations. Later, Sun et al. [153] observed that the stochastic training of ResNet and its variants is closely related to the optimal control of backward Kolmogorov’s equations, and the popular dropout regularization essentially introduces viscosity to the equations.

4 Deep Models in Medical Image Reconstruction

Classical medical image reconstruction methods, such as FBP and algebraic reconstruction method (ART) for CT imaging, are highly efficient and widely used in practice [154]. However, these methods are also prone to be sensitive to noise and incompleteness of measured data. To obtain a high-quality image, numerous regularization based models and algorithms have been developed [155,156,157] in the past three decades. In recent years, there has been a continuous effort in the medical imaging community to further advance medical image reconstruction by combining traditional image reconstruction methods with deep learning. When combining traditional handcraft modeling with deep modeling, two general approaches are often adopted: post-processing and raw-to-image. The validity of these two approaches are generally supported by, though still rather incomplete, the analysis on the approximation properties of DNNs as described in Sect. 3.1, and by the dynamics perspective on the DNNs with certain special structures as described in Sect. 3.2.

For post-processing, one needs to estimate the mapping between the initially reconstructed low-quality image and its high-quality counterpart. This is possible since DNNs can approximate generic functions or mappings as discussed in Sect. 3.1. This approach is effective mostly when the initial reconstruction and its high-quality counterpart are not drastically different. However, due to limited measurements and the presence of noise, the initially reconstructed image may contain heavy and complex artifacts which are difficult to remove even by deep models. Furthermore, the information missing from the initial reconstruction cannot be reliably recovered by any post-processing. Thus, the post-processing approach has limited performance and is more suitable to handle initial reconstructions that are of relatively high quality.

For raw-to-image, one directly estimates the mapping between the raw data (e.g., the projection data of CT and k-space data for MRI) and the reconstructed image. The challenge, however, is that the data distribution in the domain of raw data is often vastly different from that in the image domain. Learning a direct mapping using a DNN without special structures (e.g., a fully connected network or a vanilla CNN), though not impossible, may require tons of training data, can be computationally expensive and heavily relies on good initializations of the model parameters (e.g., the AUTOMAP [158]). It is well-known in the literature of handcraft modeling that the mapping ought to have certain dynamic structures which can be represented by a carefully designed (partial) differential equation or an optimization algorithm solving a certain objective function(al). Thus, it is more plausible to combine handcrafted dynamics with deep learning. The way of such combination was depicted in Sect. 3.2 in a relatively general setting where we did not discuss how \(\mathscr {F}\) should be designed for a given image restoration problem. Nonetheless, it is rather convincing that there are connections between dynamic systems and DNNs and the benefits of recognizing such connections.

Our rich history of handcraft modeling in image restoration provides us with an abundant set of tools that we can select freely for the mapping \(\mathscr {F}\) via the general technique known as the unrolling dynamics [77, 79]. To be more precise, this approach first suggests us to unroll optimization algorithms that are introduced to solve handcrafted models into feed-forward networks. Then, we incorporate our domain knowledge of the problem in-hand to determine which parameters are best to be learned from the data in an end-to-end fashion. The advantage of designing deep models via unrolling optimization algorithms is threefold: (1) the deep model through unrolling dynamics is more interpretable than a regular deep model such as U-Net; (2) the number of parameters are normally less than regular deep models and thus more suitable for small sample learning; (3) it provides a general way of combining domain knowledge with deep learning so that we can easily decide on which component in the model need to be learned and which can be handcrafted without losing much expressive power of the model.

As mentioned in the introduction that one of the major differences between medical image reconstruction and image restoration in computer vision is the quality metric of the reconstruction images. It has long been discussed in the medical imaging community that such a quality metric is best, in many scenarios, to be task-based rather than generic metrics such as PSNR and SSIM. The importance of providing such a task-based metric for medical imaging was recently discussed in the article [159]. The question is, however, how can we realize such task-based image quality metric? Recently, the authors of [160] proposed to realize task-based quality metric by “hooking” an image reconstruction network from unrolling dynamics with an image analysis DNN, so that the reconstructed images by the first network will be implicitly evaluated by the second which effectively makes the quality metric task-based. Similar idea appeared in computer vision for image denoising [161, 162]. On the other hand, these works also suggested a new “raw-to-task” modeling philosophy with encouraging empirical results. Therefore, the entire pipeline of image reconstruction, analysis, and decision making can be effectively integrated.

In the rest of this section, we provide more details on the aforementioned models.

4.1 Post-processing

Post-processing is a procedure to enhance the quality of an initially reconstructed image. In this subsection, we use CT as an example. Due to the incompleteness of measured data in sparse view and limited angle CT, the FBP reconstructed image is often degraded by streaking artifacts (Fig. 5b). Noise caused by low tube currents is another source of degradation of CT images (Fig. 5c). In [105, 163], a residual encoder–decoder CNN (Fig. 2) was used to approximate the map between the degraded image and the clean image. This model is efficient in removing noise from the FBP reconstructed CT images. To protect subtle structures in CT images while suppressing noise, Yang et al. [164] adopted a generative adversarial network (GAN) with the loss function defined by a combination of Wasserstein distance and the perceptual difference between input degraded image and the corresponding clean image.

Fig. 5
figure 5

FBP reconstructed images

To reduce the radiation dose and acquisition time, one can decrease the number of projections of X-ray CT, which is known as the sparse view or limited angel CT. Such incompleteness of measurements leads to streaking artifacts with global and yet relatively simple structures in the FBP reconstructed CT images. In this case, a DNN with multi-scale architecture can be used to capture the global patterns of streaking artifacts. With such observation, Jin et al. [165] and Han et al. [166] utilized the U-Net (Fig. 3) to reduce artifacts in FBP reconstructed sparse view CT images. The repaired high-quality CT image is the subtraction of the learned artifacts by the U-Net from the degraded input image. In some sense, U-Net takes a role of residual learning [16].

4.2 Raw-to-Image

We describe how optimization algorithms can be unrolled and set up as a deep feed-forward network for end-to-end training. We remark that, under some specific conditions, the learning empowered optimization algorithms via unrolling dynamics can have better provable convergence than the original optimization algorithms [82, 167, 168]. This was in fact the original motivation of Gregor and LeCun [77] to use machine learning to improve optimization algorithms. In this subsection, however, we shall focus on the “dual” aspect of unrolling dynamics, i.e., how optimization algorithms inspire new and more effective deep network architectures for medical image reconstruction or inverse problems in general.

4.2.1 ADMM-Net

The work of ADMM-Net proposed by Yang et al. [79] was the first to suggest the potential benefit of designing deep neural networks for inverse problems by unrolling optimization algorithms.

In the iteration scheme (1.7) of ADMM algorithm (Sect. 1.2.2), the tuning parameters such as \(\mu \), \(\lambda \) and the handcrafted operator \(\varvec{W}\) are difficult to determine adaptively for a given data set. In [79], the authors proposed to unroll the ADMM algorithm to design a new deep model, named ADMM-Net. By doing so, the tuning parameters and the predefined linear operator \(\varvec{W}\) are now all learnable from the training data. The proximal operator of the sparsity promoting function \(\varPhi =\Vert \cdot \Vert _{1}\) is parameterized by a piecewise linear function with learnable parameters as well. As a result, the thresholding operator \({\mathscr {T}}_{\lambda }(\cdot )\) in ADMM algorithm is also learned from the training data. In a basic version of ADMM-Net [79], \(\varvec{d}^{k+1}\) is updated by

$$\begin{aligned} \varvec{d}^{k+1}={\mathscr {T}}_{\varvec{\varTheta }_{1}} \left( {\mathscr {W}}_{\varvec{\varTheta }_{2}} (\varvec{u}^{k+1})+\varvec{b}^{k}\right) , \end{aligned}$$
(4.1)

where \({\mathscr {T}}_{\varvec{\varTheta }_{1}}(\cdot )\) is a parameterized piecewise linear function with parameters \(\varvec{\varTheta }_{1}\), and \({\mathscr {W}}_{\varvec{\varTheta }_{2}}\) is a parameterized convolution layer with parameters \(\varvec{\varTheta }_{2}\). The ADMM-Net was later further improved by Yang et al. [169] and the new model was called the Generic-ADMM-Net where different variable splitting strategy was adopted in the derivation of the ADMM algorithm. The Generic-ADMM-Net achieved state-of-the-art MR image reconstruction results with a significant margin over the BM3D-based algorithm.

4.2.2 Primal–Dual Networks (PD-Net)

In [80], the authors unrolled the iteration scheme (1.10) and (1.11) of the PDHG algorithm to design new deep model for CT image reconstruction. This new deep model was called the primal–dual network (PD-Net). The main idea is to approximate each resolvent/proximal operator [170] in the subproblem of PDHG by a neural network. Thus, it circumvents the difficulties in choosing optimal forms \(\varPhi \) and F. One layer of PD-Net takes the form

$$\begin{aligned} \varvec{w}^{k+1}= & {} {\mathscr {N}}_{\varvec{w}}\left( [\varvec{w}^{k},\varvec{W}\varvec{u}^{k}]; \varvec{\varTheta }_{\varvec{w}}^{k}\right) ,\nonumber \\ \varvec{u}^{k+1}= & {} {\mathscr {N}}_{\varvec{u}}\left( [\varvec{u}^{k},\varvec{W}^\mathrm{T}\varvec{w}^{k+1},\varvec{A}^\mathrm{T}\varvec{f}]; \varvec{\varTheta }_{w}^{k}\right) , \end{aligned}$$
(4.2)

where \(\varvec{f}\) is the measured data, \(\varvec{A}\) is the imaging operator, \({\mathscr {N}}_{\varvec{w}}(\cdot ;\varvec{\varTheta }_{\varvec{w}}^{k} )\) and \({\mathscr {N}}_{\varvec{u}}(\cdot ;\varvec{\varTheta }_{\varvec{u}}^{k} )\) are neural networks parameterized by \(\varvec{\varTheta }_{\varvec{w}}^{k} \) and \(\varvec{\varTheta }_{\varvec{u}}^{k}\), respectively. The notation \([\varvec{v}_1,\cdots ,\varvec{v}_m]\) denotes concatenation of the components \(\varvec{v}_1, \cdots , \varvec{v}_m\) into a higher dimension tensor. The linear operator \(\varvec{W}\) can be either fixed or learned from the data. PD-Net has a significant performance boost compared with FBP and some handcrafted reconstruction models [80, 171].

4.2.3 Joint Spatial-Radon (JSR)-Net

To suppress the artifacts induced by incomplete data and noise, Dong et al. [172] proposed a JSR domain reconstruction model for sparse view CT imaging as following:

$$\begin{aligned} \min _{\varvec{u},\varvec{f}} {\mathscr {F}}(\varvec{u},\varvec{f},\varvec{Y})+\mathscr {R}(\varvec{u},\varvec{f}), \end{aligned}$$
(4.3)

where the data fidelity term \({\mathscr {F}}(\varvec{u},\varvec{f},\varvec{Y})\) is defined by

$$\begin{aligned} {\mathscr {F}}(\varvec{u},\varvec{f},\varvec{Y})=\frac{1}{2}\Vert R_{\varvec{\varGamma }^{c}}(\varvec{f}-\varvec{Y})\Vert ^{2} +\frac{\alpha }{2}\Vert R_{\varvec{\varGamma }}(\varvec{A}\varvec{u}-\varvec{f})\Vert ^{2} +\frac{\gamma }{2}\Vert R_{\varvec{\varGamma }^{c}}(\varvec{A}\varvec{u}-\varvec{Y})\Vert ^{2}, \end{aligned}$$

and the regularization term defined by

$$\begin{aligned} \mathscr {R}(\varvec{u},\varvec{f})=\Vert \varvec{\lambda }_{1}\cdot \varvec{W}_{1}\varvec{u}\Vert _{1,2} +\Vert \varvec{\lambda }_{2}\cdot \varvec{W}_{2}\varvec{f}\Vert _{1,2}. \end{aligned}$$

The notation \( R_{\varvec{\varGamma }}\) is a restriction operator with respect to the missing data region indexed by \(\varvec{\varGamma }\). \(R_{\varvec{\varGamma }}\) takes value 1 if the element’s index contained in \(\varvec{\varGamma }\) and 0 elsewhere. Here, \(\varvec{\varGamma }^{c}\) indicates the region of available measured data and is the complement of \(\varvec{\varGamma }\). \(\varvec{A}\) is a discrete form of the Radon transform, \(\varvec{Y}\) is the measured projection data. Note that, in JSR model \(\varvec{u}\) and \(\varvec{f}\) are the underlying CT image and the restored high-quality projection data, respectively. \(\varvec{W}_{i},i=1,2\), are tight frame transforms and \(\varvec{\lambda }_{i},i=1,2\), are the regularization parameters.

The handcrafted JSR model (4.3) enforces the data consistency in the Radon domain and image domain simultaneously. Thus, it leads to improved quality of the reconstructed image. Similar data fidelity design was adopted in [173] to model the positron emission tomography. Later, Zhan and Dong [174] proposed to improve the JSR model by learning the tight frame transforms \(\varvec{W}_i\) from the data. More recently, a re-weighting strategy was introduced in JSR model to reduce the metal artifacts in multi-chromatic energy CT [175].

Existing work showed the potential of the JSR framework, and it is natural to consider using unrolling to derive a deep model from algorithms solving the JSR model. In [85], the authors designed the JSR-Net for sparse view and limited angle CT image reconstruction. The JSR-Net is derived by unrolling an alternative optimization algorithm with subproblems solved by ADMM. Similar to the PD-Net, JSR-Net also adopted neural networks to approximate the proximal operators. The advantage of JSR-Net is that it can efficiently utilize multi-domain image features to improve the quality of the reconstructed image.

4.3 Raw-to-Task

The traditional workflow of medical image analysis has two separate stages: (1) reconstruction of a high-quality image from raw data (see Fig. 6a), and (2) make a diagnosis based on the high-quality reconstructed image (see Fig. 6b). The drawbacks of the two-stages’ approach and the potential benefit of uniting the two stages were discussed earlier. Here, we shall describe how we can join the two stages into one unified step (see Fig. 6c).

Fig. 6
figure 6

CNN based workflows for medical image reconstruction and analysis

As discussed in Sect. 4.2 that we can design feed-forward deep networks for image reconstruction. Once we have an image, there are plenty of choices of deep neural networks for various image analysis tasks. The most simple and natural way of joining image reconstruction and image analysis is to connect the two networks together and conduct end-to-end training (from scratch or by fine-tuning). Such idea was first introduced by Wu et al. [160] in medical imaging and by Liu et al. [161, 162] in computer vision for image denoising. By doing so, the second network for image analysis can be regarded as a task-based image quality metric that is learned from the data. As shown in Wu et al. [160], where the image analysis task was lung nodule recognition, the learned image quality metric automatically placed more emphasis within the lung areas and less emphasis elsewhere. Such a quality metric is specific to the task of lung nodule recognition since the image quality outside of the lung region is irrelevant to the task.

5 Challenge and Opportunities

Although deep learning based models continue to dominant medical imaging, there are still plenty of remaining challenges in deep modeling which limit the application and implementation of these new methods in clinical practice. These challenges also present themselves as new opportunities for researchers working in related fields.

  • The everlasting hunger of labeled data. There are only limited labeled data available to develop new deep models in medical imaging. Annotation of medical images is time-consuming and requires expert knowledge from physicians. Can we design effective learning models that can make good use of both the (very limited) labeled data and the (relatively more abundant) unlabeled data?

  • The limited number of observations. Due to morbidity and privacy concerns, it is generally difficult to gather very large medical data for a specific task. Furthermore, the number of rare cases is (by definition) small but can be much more valuable than common cases. Can we design learning models and data augmentation techniques to effectively extract knowledge from these limited samples and acknowledge such unequal importance among the samples?

  • Radiologists do not make the clinical decision only based on images. More information from the patients and the knowledge of the doctors from their years of training in medical school are also crucial in decision making. Thus, incorporating data gathered from multiple diverse sources into deep modeling is important in improving system performance.

  • Reasoning is just as important as, if not more important than, inferencing. Currently, most deep models hide the reasoning procedure. There is a chance that the model makes accurate predictions based on wrong reasoning. This makes the model unreliable. Can we incorporate deep modeling with reasoning (such as causal inference) or with medical knowledge graph? This may further reduce the amount of annotated data we need to train deep models without hurting performance.