Keywords

1 Introduction

In the era of automation, accurate and efficient automated processing of documents is of the utmost importance for streamlining modern business workflows [1,2,3]. At the same time, it has vast applications in the preservation of historical scriptures [4,5,6] that contain valuable information about ancient cultural heritages and scientific contributions. Deep learning (DL) has recently emerged as a powerful tool for handling a wide variety of document processing tasks, showing remarkable results in areas such as document classification [1, 7], optical character recognition (OCR) [8], and named entity recognition (NER) [2, 9]. However, it remains challenging to apply DL-based models to real-world documents due to a variety of distortions and degradations that frequently occur in these documents. Document image enhancement (DIE) is a core research area in document analysis that focuses on recovering clean and improved images of documents from their degraded counterparts. Depending on the severity of the degradation, a document may display wrinkles, stains, smears, or bleed-through effects [10,11,12]. Additionally, distortions may result from scanning documents with a smartphone, which may introduce shadows [13], blurriness [14], or uneven illumination. Such degradations, which are particularly prevalent in historical documents, can significantly deteriorate the performance of deep learning models on downstream document processing tasks [15]. Therefore, it is essential that prior to applying these models, there be a pre-processing step that performs denoising and recovers a clean version of the degraded document image.

Over the past few decades, DIE has been the subject of several research efforts, including both classical [16, 17] and deep learning-based studies [6, 13, 18, 19]. Lately, generative models such as deep variational autoencoders (VAEs) [20] and generative adversarial networks (GANs) [21] have gained popularity in this domain, owing to their remarkable success in natural image generation [21, 22] and restoration tasks [23,24,25]. Generative models have attracted considerable attention due to their ability to accurately capture the underlying distribution of the training data, which allows them not only to generate highly realistic and diverse samples [22], but also to generate missing data when necessary [26]. As a result, a number of GAN and VAE based approaches have been recently proposed for DIE tasks, such as binarization [6, 18, 27], deblurring [6, 19], and watermark removal [6].

Diffusion models [28] are a new class of generative models inspired by the process of diffusion in non-equilibrium thermodynamics. In the context of image generation, the underlying mechanism of diffusion models involves a fixed forward process of gradually adding Gaussian noise to the image, and a learnable reverse process to denoise and recover the clean image, utilizing a Markov chain structure. Diffusion models have been shown to have several advantages over GANs and VAEs such as their high training stability [28,29,30], diverse and realistic image synthesis [31, 32], and better generalization to out-of-distribution data [33]. Additionally, conditional diffusion models have been employed to perform image synthesis with an additional input, such as class labels, text, or source image and have been successfully adapted for various natural image restoration tasks, including super-resolution [34], deblurring [35], and JPEG restoration [36]. Despite their growing popularity, however, there is no existing literature that has explored their potential in the context of document image enhancement.

In this study, we investigate the potential of diffusion models for the task of document image binarization, and introduce a novel approach for restoring clean binarized images from degraded document images using cold diffusion. We conduct a comprehensive evaluation of our proposed approach on multiple publicly available benchmark datasets for document binarization, demonstrating the effectiveness of our methodology in producing high-quality binarized images from degraded document images. The main contributions of this paper are two-fold:

–:

To the best of the authors’ knowledge, this is the first work that presents a flexible end-to-end document image binarization framework based on diffusion models.

–:

We evaluate the performance of our approach on 9 different benchmark datasets for document binarization which include DIBCO ’9 [37], H-DIBCO ’10 [11], DIBCO ’11 [38], H-DIBCO ’12 [39], DIBCO ’13 [12], H-DIBCO ’14 [40], H-DIBCO ’16 [41], DIBCO ’17 [42], and H-DIBCO ’18 [43].

–:

Through a comprehensive quantitative and qualitative evaluation, we demonstrate that our approach outperforms several classical approaches as well as the existing state-of-the-art on 7 of the datasets, while achieving competitive performance on the remaining 2 datasets.

2 Related Work

2.1 Document Image Enhancement

Document image enhancement (DIE) has been extensively studied in the literature over the past few decades [5, 16, 44,45,46]. Classical approaches to DIE were primarily based on global thresholding [16], local thresholding [44, 47] or their hybrids [48]. These approaches were based on determining threshold values to segment the image pixels of a document into foreground or background. In a different direction, energy-based segmentation approaches such as Markov random fields (MRFs) [49] and conditional random fields (CRFs) [50] and classical machine learning-based approaches such as support vector machines (SVMs) [17, 51] have also been widely explored in the past.

In recent years, there has been a burgeoning interest in the application of deep learning-based techniques for the enhancement of document images [4, 52,53,54]. The earliest work in this area was majorly focused on utilizing convolutional neural networks (CNNs) [4, 5, 52, 55]. One notable example of this is the work of Pastor-Pellicer et al. [56], who proposed a CNN-based classifier in conjunction with a sliding window approach for segmenting images into foreground and background regions. Building upon this, Tensmeyer et al. [52] presented a more advanced methodology that entailed feeding raw grayscale images, along with relative darkness features, into a multi-scale CNN, and training the network using a pseudo F-measure loss. Another approach was proposed by Calvo-Zaragoza et al. [55], in which they utilized a CNN-based auto-encoder (AE) to train the model to map degraded images to clean ones in an end-to-end fashion. A similar approach was presented by Kang et al. [5], who employed a pre-trained U-Net based auto-encoder model for binarization, with minimal training data requirements. Since then, a number of AE-based approaches have been proposed for DIE tasks [53, 54]. In a slightly different direction, Castellanos et al. [57] has also investigated domain adaptation in conjunction with deep neural networks for the task of document binarization.

Generative Adversarial Networks (GANs) have also been extensively explored in this field to generate clean images by conditioning on degraded versions [6, 19, 27, 46]. These methods typically consist of a generative model that generates a clean binarized version of the image, along with a discriminator that assesses the results of the binarization. Zhao et al. [27] proposed a cascaded GAN-based approach for the task of document image binarization and demonstrated excellent performance on a variety of benchmark datasets. Jemni et al. [58] recently presented a multi-task GAN-based approach which incorporates a text recognition network in combination with the discriminator to further improve text readability along with binarization. Similarly, Yu et al. [46] proposed a multi-stage GAN-based approach to document binarization that first applies discrete wavelet transform to the images to perform enhancement, and then trains a separate GAN for each channel of the document image. Besides GANs and CNN-based auto-encoders, the recent success of transformers in natural language processing (NLP) [9] and vision [59] has also sparked interest in transformers for the enhancement of document images. In a recent study, Souibgui et al. [45] proposed a transformer-based auto-encoder model that demonstrated state-of-the-art performance on several document binarization datasets.

3 ColDBin: The Proposed Approach

This section presents the details of our proposed approach and explains its relationship to standard diffusion [28]. The overall workflow of our approach is illustrated in Fig. 1. Primarily inspired by cold diffusion [60], our approach involves training a deep diffusion network for document binarization in two steps: a forward diffusion step and a reverse restoration step. As shown, in the forward diffusion step, a clean ground-truth document image is degraded to a specified severity level based on a given type of input degradation. In the reverse restoration step, a neural network is tasked with undoing the forward diffusion process in order to generate a clean ground-truth image from an intermediary degraded image. These forward and reverse steps are repeated in a cycle, and the neural network is trained for the binarization task by applying image reconstruction loss to its output. In the following sections, we provide a more detailed explanation of the forward and reverse steps of our approach.

Fig. 1.
figure 1

Demonstration of the forward diffusion and reverse restoration processes of our approach. The forward diffusion process incrementally degrades a clean ground-truth image into its degraded counterpart. Whereas, the reverse restoration process, defined by a neural network, generates a clean binary image from a degraded input image

3.1 Forward Diffusion

In the context of document binarization, let \(P=\{(x,x_{0})\sim (\mathcal {X}, \mathcal {X}_{0})\}_{n=1}^N\) define a training set consisting of pairs of degraded document images x and their corresponding binarized ground-truth images \(x_{0}\). Let \(\mathbb {D}(x_0, t)\) be a diffusion operator that adds degradation to a clean ground-truth image \(x_0\) proportional to the severity \(t \in \{0,1,\dots ,T\}\), T being the maximum severity permitted, then the degraded image at any given severity t can be derived as follows:

$$\begin{aligned} x_t=\mathbb {D}(x_0, t) \end{aligned}$$
(1)

Consequently, the following constraint must be satisfied:

$$\begin{aligned} \mathbb {D}(x_0, 0) = x_0 \end{aligned}$$
(2)

Generally in standard diffusion [28], this forward diffusion operator \(\mathbb {D}(x_0, t)\) is defined as a fixed Markov process that gradually adds Gaussian noise \(\epsilon \) to the image using a variance schedule specified by \(\beta _1\dots \beta _T\). In particular, it is defined as the posterior \(q(x_1,\dots , x_T | x_0)\) that converts the data distribution \(q(x_0)\) to the latent distribution \(q(x_T)\) as follows:

$$\begin{aligned} q(x_1,\dots ,x_T|x_0)&:=\prod _{t=1}^{T}q(x_t|x_{t-1}) \\ q(x_t|x_{t-1})&:= \mathcal {N}(x_t; \sqrt{\beta _t}x_{t-1}, (1-\beta _t)\textbf{I}) \end{aligned}$$

where \(\beta _t\) is a hyper-parameter that defines the severity of degradation at each severity level t. An important property of the above forward process is that it allows sampling \(x_t\) at any arbitrary severity t in closed form: using the notation \(\alpha _t:=1 - \beta _t\) and \(\hat{\alpha }_t:=\varPi _{s=1}^t\alpha _s\), we have

$$\begin{aligned} q(x_t|x_0) := \mathcal {N}(x_t; \sqrt{\hat{\alpha }_t}x_{t-1}, (1-\hat{\alpha }_t)\textbf{I}) \end{aligned}$$
(3)

Which results in the following the diffusion operator \(\mathbb {D}(x_0, t)\):

$$\begin{aligned} x_t = \mathbb {D}(x_0, t) = \sqrt{\hat{\alpha }_t}x_{0} + \sqrt{1-\hat{\alpha }_t}\epsilon , \quad \epsilon \sim \mathcal {N}(\textbf{0},\textbf{I}) \end{aligned}$$
(4)

Our approach maintains the same forward process as standard diffusion, except that Gaussian noise \(\epsilon \) is not used to define the diffusion operator \(\mathbb {D}(x_0, t)\) (hot diffusion). Rather, we define it as a cold diffusion operation that interpolates between the binarized ground-truth image \(x_0\) and its degraded counterpart image x based on the noise schedule \(\beta _1\dots \beta _T\). More formally, given a fully degraded input image x and its respective binarized ground-truth image \(x_{0}\), an intermediate degraded image \(x_t\) at severity t is then defined as follows:

$$\begin{aligned} x_t = \mathbb {D}(x_{0}, x, t) = \sqrt{\hat{\alpha }_t}x_{0} + \sqrt{1-\hat{\alpha }_t}x, \quad x_{0} \sim \mathcal {X}_{0}, x \sim \mathcal {X} \end{aligned}$$
(5)

Note that this procedure is essentially the same as adding Gaussian noise \(\epsilon \) in standard diffusion, except that here we are adding a progressively higher weighted degraded image to the clean ground-truth image to generate an intermediary noisy image. In addition, our diffusion operator for binarization is slightly modified \(\mathbb {D}(x_{0}, x, t)\) and requires both the ground-truth image \(x_0\) and the target degraded image x for forward the process.

3.2 Reverse Restoration

Let \(\mathbb {R}(x_t, t)\) define the reverse restoration operator that restores any degraded image \(x_t\) at severity t to its clean binarized form \(x_0\):

$$\begin{aligned} \mathbb {R}(x_t, t) \approx x_0 \end{aligned}$$
(6)

In standard diffusion [28], generally this restoration operator \(\mathbb {R}(x_t, t)\) is defined as a reverse Markov process \(p(x_0,\dots ,x_{T-1}|x_T)\) that transforms the data from the latent variable distribution \(p_\theta (x_T)\) to the data distribution \(p_\theta (x_0)\) parameterized by \(\theta \); the process generally starting from \(p(x_T) = \mathcal {N}(x_T ; \textbf{0},{\textbf {I}})\):

$$\begin{aligned} p(x_0,\dots ,x_{T-1}|x_T)&:=\prod _{t=1}^{T}p_\theta (x_{t-1}|x_t) \\ p_\theta (x_{t-1}|x_t)&:=\mathcal {N}(x_{t-1}; \mu _\theta (x_t, t), \sigma _\theta (x_t, t)^2\textbf{I}) \end{aligned}$$

Our approach uses the same reverse restoration process as the standard diffusion [28], with the exception that it begins with a degraded input image \(x_T \sim \mathcal {X}\) instead of Gaussian noise \(x_T \sim \mathcal {N}(x_T ; \textbf{0},{\textbf {I}})\). In practice, \(\mathbb {R}(x_t, t)\) is generally implemented as a neural network \(\mathbb {R}_\theta (x_t, t)\) parameterized by \(\theta \) which is trained to perform the reverse restoration task. In our approach, the restoration network \(\mathbb {R}_\theta (x_t, t)\) is trained by minimizing the following loss:

$$\begin{aligned} \min _\theta \mathbb {E}_{x\sim \mathcal {X},x_{0}\sim \mathcal {X}_{0}}||\mathbb {R}_\theta (D(x_{0}, x, t), t) - x_{0}|| \end{aligned}$$
(7)

where \(||\cdot ||\) defines a norm, which we took as standard \(\ell _2\) norm in this work. The overall training process of the restoration network is given in Algorithm 1. As shown, the restoration network \(\mathbb {R}_\theta (x_t, t)\) is initialized with a maximum severity level of T. In each training iteration, a mini-batch of degraded images x and their corresponding binarized ground-truth images \(x_0\) is randomly sampled from the training set P, and the degradation severity is randomly sampled from the integer set \(\{1,\dots ,T\}\) The severity value t is then used in combination with the ground-truth \(x_0\) and degraded image x pairs to compute the intermediate interpolated images \(x_t\) using Eq. 5 (line 6). The restoration network \(\mathbb {R}_\theta (x_t, t)\) is then used to recover a binarized image from the interpolated image \(x_t\). Finally, the network is optimized in each step by taking the gradient step on Eq. 7 (line 6).

figure a

3.3 Restoration Network

The complete architecture of the restoration network \(\mathbb {R}_\theta (x_t, t)\) used in our approach is illustrated in Fig. 1. As shown, we used a U-Net [61] inspired architecture as the restoration network which takes as input the degraded image \(x_t\) and the diffusion severity \(t \in {1, 2, \dots , T}\) and generates a binarized image as the output. The input severity level t is transformed into a severity embedding \(t_e\) based on sinusoidal positional encoding as proposed in [62]. The embedded severity and the image are then passed through multiple downsampling blocks, a middle processing block and then multiple upsampling blocks to generate the output image. Each downsampling and upsampling block is characterized by two ConvNeXt [63] blocks, a residual block with a linear attention layer, and a downsampling layer. The middle block consists of a ConvNeXt block followed by an attention module and another ConvNeXt block and is inserted between the downsampling and upsampling phases.

3.4 Inference Strategies

figure b

We investigated two different inference strategies for restoring images from their degraded counterparts: direct restoration and cold diffusion sampling. Direct restoration simply applies the restoration operator \(\mathbb {R}_\theta (x_t, t)\) to a degraded input image x with degradation severity t set to T. On the other hand, cold diffusion sampling as proposed in [60] iteratively performs the reverse restoration process over T steps as described in Algorithm 2. Although a number of sampling strategies have been proposed previously for diffusion models [28, 64], Bansal et al. [60] demonstrated in their work that this sampling strategy performs better than standard sampling [28] for cold diffusion processes, and therefore it has been investigated in this study.

4 Experiments and Results

In this section, we first describe the experimental setup, including datasets, evaluation metrics, and the training process. Subsequently, we present a comprehensive quantitative and qualitative analysis of our results.

4.1 Experimental Setup

Datasets. 9 different DIBCO document image binarization datasets were used to assess the performance of our proposed approach. These datasets include DIBCO ’9 [37], DIBCO ’11 [38], DIBCO ’13 [12], and DIBCO ’17 [42], as well as H-DIBCO ’10 [11], H-DIBCO ’12 [39], H-DIBCO ’14 [40], H-DIBCO ’16 [41], and H-DIBCO ’18 [43]. A variety of degraded printed and handwritten documents are included in these datasets, which exhibit various degradations such as ink bleed through, smudges, faded text strokes, stain marks, background texture, and artifacts.

Evaluation Metrics. Several evaluation methods have been commonly used in the literature for evaluating the binarization of document images, including FM (F-Measure), pFM (pseudo-F-Measure), PSNR (Peak Signal-to-Noise Ratio), and DRD (Distance Reciprocal Distortion), which have been adopted in this study. A higher value indicates better binarization performance for the first three metrics, while the opposite is true for DRD. Due to space constraints, detailed definitions of these metrics are omitted here and can be found in [11, 12].

Data Preprocessing. To train the restoration model on a specific DIBCO dataset, all the images from other DIBCO and H-DIBCO datasets as well as the Palm Leaf dataset [65] were used. The training set was prepared by splitting each degraded image and its corresponding ground truth image into overlapped patches of size \(384\times 384\times 3\). Table 1 shows the total number of training set samples that were generated for each DIBCO dataset as a result of using the above strategy. During training, a random crop of size \(256\times 256\times 3\) was extracted from each image and then fed to the model. Additionally, a number of data augmentations were used such as horizontal flipping, vertical flipping, color jitter, grayscale conversion, and Gaussian blur, all of which were randomly applied to the images. A specific augmentation we used in our approach was to randomly colorize the degraded image using the inverted ground truth image as a mask. This augmentation was necessary to prevent the models from overfitting to black-color text since most of the images in the DIBCO datasets consisted of black-color text on various backgrounds. Furthermore, we used ImageNet normalization with per-channel means of \(\mu _{RGB}=\{0.485, 0.456, 0.406\}\) and standard deviations of \(\sigma _{RGB}=\{0.229, 0.224, 0.225\}\) to normalize each image before feeding it to the model.

Table 1. The size of the training and test sets for all DIBCO datasets is provided.

Training Hyperparameters. We initialized our restoration networks with maximum diffusion severity T set to 200 and severity embedding set to 64. For the forward diffusion process, we used a cosine beta noise schedule \(\beta _1,\dots ,\beta _T\) as described in [66]. We trained our networks for 400k iterations with a batch size of 128, Adam optimizer, and a fixed learning rate of \(2e-5\) on 4–8 NVIDIA A100 GPUs.

Table 2. Comparison of different evaluation strategies on DIBCO ’9 [37], H-DIBCO ’12 [39], and DIBCO ’17 [42] datasets. The top strategy for each metric is bolded.

Evaluation Hyperparameters. To evaluate our approach, we divided each image into patches of fixed input size, restored them using the inference strategies outlined in Sect. 3.4, and then reassembled them to produce the final binarized image. Depending on the size of the input patch, binarization performance can be greatly affected, since smaller patches provide less context for the model, whereas larger patches provide more context. In this work, we examined two different patch sizes at test time, which were \(256\times 256\) and \(512\times 512\). It should be noted that we trained the models solely on \(256\times 256\) input images, and used images of size \(512\times 512\) only during evaluation.

4.2 Choosing the Best Evaluation Strategy

In this section, we compare the results of direct restoration and cold diffusion sampling strategies with varying patch sizes on three different datasets DIBCO ’9 [37], H-DIBCO ’12 [39], and DIBCO ’17 [42] as shown in Table 2. It is evident from the table that direct restoration performed significantly better than cold diffusion sampling for binarization with both patch sizes of 256 and 512. While diffusion models are well known for providing better reconstruction/image generation performance when using sampling as compared to direct inference over T steps, sampling resulted in poorer FM, p-FM, PSNR, and DRD values than direct restoration in our case. Moreover, sampling is a very computationally intensive process, requiring multiple forward and reverse diffusion steps, whereas direct inference requires only a single step and, therefore, is extremely fast. Also evident from the table is that the model performed better with patch sizes of \(512\times 512\) as opposed to \(256\times 256\). This was the case for the majority of DIBCO datasets we examined. However, we observed that the \(256\times 256\) patch size provided better performance for some datasets such as H-DIBCO ’16 [41] and H-DIBCO ’18 [43]. This raises the question of whether it is possible to develop a more effective evaluation approach that is able to accommodate images of different sizes and text resolutions within those images. However, we leave those questions to future research. Since \(512\times 512\) patch size with direct restoration offered the best performance for most datasets, we present only the results from this evaluation strategy when doing a performance comparison in the subsequent sections.

Fig. 2.
figure 2

Binarization results for images 1 (top) and 10 (bottom) of the H-DIBCO ’16 [41] dataset using our approach. The difference between the ground truth image and the binarized output of our proposed approach is shown to emphasize that our model produces slightly thicker strokes for this dataset

Table 3. Performance evaluation of different methods for document binarization on all the DIBCO/H-DIBCO evaluation datasets. For each metric, the top 1st, 2nd, and 3rd methods are bolded, italicized, and underlined, respectively. The results presented here were generated using the Direction Restoration / 512 evaluation strategy.
Fig. 3.
figure 3

Qualitative results of our proposed method for the restoration of a few samples from the DIBCO and H-DIBCO datasets. These images are arranged in columns as follows: Left: original image, Middle: ground truth image, Right: binarized image using our proposed method

4.3 Performance Comparison

In this section, we present a quantitative comparison of our approach against a variety of other approaches, including classical approaches [16, 44, 67, 68, 72], CNN-based VAEs [52, 69, 70], GAN-based approaches [27, 46, 58, 71], and Transformer-based autoencoders [45]. The results of our evaluation are summarized in Table 3, where FM, p-FM, PSNR, and DRD of each method are compared for different DIBCO/H-DIBCO datasets, with the top three approaches for each dataset bolded, italicized and underlined, respectively. As shown, our approach outperforms existing classical and state-of-the-art (SotA) approaches on 7 datasets, including DIBCO ’9 [37], H-DIBCO ’10 [11], DIBCO ’11 [38], H-DIBCO ’12 [39], DIBCO ’13 [12], H-DIBCO ’14 [40], and DIBCO ’17 [42], ranking first on the majority of metrics, while performing competitively on the remaining 2 datasets H-DIBCO ’16 [41] and H-DIBCO ’18 [43]. It is worth mentioning that a number of recent SotA binarization techniques, including those presented by Yu et al. [46] and Jemni et al. [58], utilize several training stages, networks, or target objectives in order to achieve the reported results. Comparatively, our approach employs only a single diffusion network in an end-to-end fashion, and is able to outperform these methods across multiple datasets.

On DIBCO ’9 [37] dataset, our approach scored the highest on all metrics except DRD, on which it ranked second. Furthermore, it demonstrated significant improvements in FM and PSNR on the H-DIBCO ’10 [11] and DIBCO ’11 [38] datasets in comparison to existing methods. We also observed a particularly noticeable improvement in PSNR with our approach on the H-DIBCO ’12 [39], DIBCO ’13 [12], and H-DIBCO ’14 [40] datasets, with increases of 1.11, 1.71, and 1.91 compared to the previous state-of-the-art method, respectively. Similarly, despite lower DRD values on some datasets, it was significantly improved for these three datasets, with values of 1.28, 1.20, and 0.66, respectively. Similar performance improvements were observed on the DIBCO ’17 [42] dataset as well, where our approach ranked first on FM, PSNR, and DRD, and ranked second on p-FM. On H-DIBCO ’18 [43], our approach placed third; however, it is evident from the results that our model demonstrated comparable performance to the top approaches.

Fig. 4.
figure 4

Document binarization results for the input image 12 of DIBCO ’17 [42] by different methods

Despite the high performance achieved on other datasets, our approach failed to achieve satisfactory results on the H-DIBCO ’16 [41] dataset. Interestingly, upon inspecting the binarization outputs, we found that our approach was, in fact, quite capable of producing high quality binarization results for this dataset. The approach, however, had the tendency to generate slightly thicker text strokes compared to the ground truth images, which may explain why it did not produce the best quantitative results on this dataset. Figure 2 illustrates this effect by presenting two samples from the H-DIBCO ’16 [41] dataset along with their corresponding ground truth images, binarized images derived from our method, and their difference. As can be seen from the difference image, our proposed approach produces binarized outputs very similar to the ground truth but with slightly thicker strokes in comparison. Overall, we observed that our approach demonstrated relatively consistent performance across the majority of DIBCO datasets and provided the highest FM and PSNR.

Fig. 5.
figure 5

Document binarization results for the input image HW5 of DIBCO ’13 [12] by different methods

4.4 Qualitative Evaluation

This section presents a qualitative analysis of the binarization performance of our approach. In Fig. 3, we compare the binarization results of our approach with the ground truth for a few randomly selected samples from the different DIBCO and H-DIBCO datasets. As evident from the figure, our approach was highly effective at removing various types of noise, such as stains, smears, faded text, and background texture from a number of degraded document images. Moreover, it was able to produce high-quality binarized images that were visually comparable to the corresponding ground truth images, reflecting the exceptional quantitative performance discussed in the previous section.

Aside from comparisons with ground truth, we also compare the results of our approach to both classical and existing state-of-the-art (SotA) approaches. Figure 4 illustrates the binarization performance of various approaches, including ours, on sample 12 of the DIBCO ’17 [42] dataset. The results demonstrate that our approach was successful in restoring a highly degraded document sample that many other approaches, including the multi-task GAN approach by Jemni et al. [58], failed to sufficiently restore. Interestingly, our results for this sample were visually similar to those obtained by Souibgui et al. [45], who used an encoder-decoder Transformer architecture for binarization. In Fig. 5, we compare the binarization performance of various approaches on another sample, namely, the HW5 from the DIBCO ’13 [12] dataset. As can be seen, our approach was successful in restoring the image entirely, with the resulting image looking strikingly identical to the ground truth image. Additionally, we observed that our results for this sample were similar but slightly better than those of Suh et al. [71] and Yu et al. [46], who employed two-stage and three-stage GAN-based approaches for binarization, respectively.

4.5 Runtime Evaluation

In this section, we briefly analyze the runtime of our approach and compare it with other approaches. Since binarization speed depends on the size of input images, we evaluate the runtime in terms of secs/megapixel (MP) as used in prior works [27, 72]. Both direct reconstruction and cold sampling were evaluated using a single NVIDIA GTX 1080Ti GPU with batch sizes of 4 and 32 for 512 \(\times \) 512 and 256 \(\times \) 256 image resolutions, respectively. The evaluation runtimes for other approaches were obtained directly from two papers [27, 72], which may have used different resources for evaluation and therefore we are only able to make a rough comparison. As shown in Table 4, with direct reconstruction, our approach had a runtime of \(\sim \)1 sec/MP for both input image resolutions, which is comparable to the approach developed Zhao et al. [27], and is much lower than other computer vision methods [67, 68] and deep learning approaches [52, 69]. In contrast, the runtime for cold sampling scaled proportionally with the number of diffusion steps T. With T=200 in our experiments, for a \(256\times 256\) input resolution, sampling took \(\sim \)135\(\times \) more time than direct reconstruction, and for a \(512\times 512\) input resolution, it took \(\sim \)193\(\times \) more time. Thus, direct reconstruction was not only effective quantitatively and qualitatively, but also time-efficient in comparison with sampled reconstruction. It is worth noting that the problem of unreasonably high sampling times in diffusion models is well-known, and different sampling strategies [64, 73] have been proposed recently to overcome this problem.

Table 4. Average runtimes for different binarization methods.

5 Conclusion

This paper presents an end-to-end approach for binarization of document images using cold diffusion, which involves gradually transforming clean images into their degraded counterparts and then training a diffusion model that learns to reverse that process. The proposed approach was evaluated on 9 different DIBCO document benchmark datasets, and our results demonstrate that it outperforms traditional and state-of-the-art methods on a majority of datasets and does equally well on others. Despite its promising potential for document binarization, we believe it is also pertinent to discuss its limitations. As is the case with deep networks generally, the reliability of our models was quite dependent on the availability of data. While training datasets (DIBCO and Palm Leaf combined) have quite a lot of diversity in terms of sample distribution, the intraclass variance of samples was rather low, which necessitated training the models for a large number of iterations with various data augmentations in order to achieve the reported results. Therefore, to further enhance the performance of deep network-based approaches in the future, it may be worthwhile to invest resources in the creation of a large independent and diverse training dataset (whether synthetic or not) for binarization. We also observed a significant correlation between patch size and binarization performance with our approach. To address this issue in the future, it may be worthwhile to investigate the possibility of conditioning the output of our model on the surrounding context of each image patch.