Keywords

1 Introduction

Binarization algorithms take as input a color or grayscale image and produce a bi-tonal image as result. Binarization is a key preprocessing step in document analysis solutions such as automatic character recognition and layout analysis, as the quality of the monocromatic document has a strong effect on the performance for such document processing tasks. Besides that, document binarization can increase its readability and work as a file compression strategy [58], as the size of binary images is often orders of magnitudes smaller than the original gray or color images, requiring much less storage space or computer bandwidth for network transmission. Thus, assessing the quality of document binarization algorithms is of importance, a concern witnessed by contests that started more than a decade ago [14, 44].

Any algorithm makes design assumptions about the nature of the data being processed. Document binarization algorithms are no exception. The DIB Team (https://dib.cin.ufpe.br/) presented evidence that “no binarization algorithm is good for all kinds of documents”. Thus, depending on the intrinsic features of each document, a binarization algorithm may perform better than another. The second point made by the DIB team is that “time matters”. As binarization algorithms are part of several document processing pipelines, it is also important to balance the quality-time binomial. Several algorithms may produce binary images of similar quality with huge differences in time complexity.

Recently, a new series of document binarization contests were created not only comparing the enrolled participants among themselves, but also comparing the quality-time performance of the new with “classical” algorithms. Besides that, much larger datasets of “real-world” and synthetic documents were tested in different feature clusters. Scanned documents either historic (machine typed or handwritten, 200 or 300 dpi, with different hues and textures of paper, and with different degrees of back-to front interference) and bureaucratic (200 dpi, offset printed, and with light back-to-front interference) have their binarization quality-time assessed in contests [32, 33].

This ICDAR 2021 competition assessed the quality-time performance of 12 new algorithms from seven teams from all over the world. The competitors’ algorithms were also compared with the quality-time performance of 49 other binarization algorithms that may be considered “classic”. Four test sets of “real-world” 200 and 300 dpi scanned documents were used in this assessment.

2 Participants

Seven research groups from six different countries, from the Americas, Asia, and Europe enrolled in this 2021 competition. In what follows, the teams are presented together with a brief outline of their approaches.

2.1 AutoHome Corp, Haidian Strict, Beijing, China Team: Huang Xiao, Liu Rong, Xu Chengshen, Li Lin, Ye Mingdeng

A combination of binary cross entropy and dice loss is chosen as the loss function of a deep-learning algorithm. Data augmentation is performed in the training process so as to improve the scores. The original colored or gray images are divided into patches with the same dimension (e.g. 128*128). For each colored patch, a trained Unet model is utilized to obtain a binarized patch. A binarized large image with the same size as the original image can be obtained with the combination of those binarized patches. In this method, the model stacking technique is performed via two Unet models with patch dimensions of 128*128 and 256*256. Further, a global view with a patch dimension of 512*512 is also combined to obtain the final results. The model with a global view is trained aiming to capture the global context and the character locations.

This team presented two different methods to this competition:

Method (1) - : The segmentation model is BCD-Unet based [2].

Method (2) - : The segmentation model is Unet based.

2.2 Universitas Syiah Kuala, Indonesia Team: Khairun Saddami

Method (1) - : An extension of the NICK binarization method [52]. The image standard deviation is used to determine the k value which is calculated as \(k = -\sigma / (255-1.5\sigma )\), where \(\sigma \) is the image standard deviation that represents the image contrast.

Method (2) - : Combination of Niblack and Wolf [49]. The threshold \(T = (2m + mk ((\sigma /m) - (\sigma /R)-1)) / 2\), where \(\sigma \) is the image standard deviation, m is the mean of local window, R is the maximum standard deviation, \(k = 0.35\).

Method (3) - : Combined the local adaptive and global thresholding formulas, as described in [50].

2.3 West Pomeranian University of Technology, Poland Team: Hubert Michalak and Krzysztof Okarma

Method (1) - : The first step of the algorithm is related to image downsampling where one of well-known interpolation methods may be applied. For this purpose the MATLAB function imresize was used with bilinear and the simple nearest neighbour method. Application of a relatively large kernel during downsizing of the image results in the loss of details related to shapes of individual characters. Therefore only the low frequency image data is preserved representing the overall distribution of the image brightness, being in fact mainly the downsampled background information. After resizing back the downsampled image to the original size using the same kernel, the image containing only the low frequency information is obtained, representing the approximated high resolution background. In the next step of the proposed method the subtraction of this image from the original is made to enhance the text data, followed by simple contrast increase and logical negation. The image obtained is subjected to the fast global thresholding using Otsu method.

Method (2) - : The proposed method is based on the equalization of the illumination of an image, increasing also its contrast, making it easier to conduct the proper binarization. It is based on the analysis of the local entropy, assuming its noticeably higher values in the neighbourhood of the characters. Hence, only the relatively high entropy regions should be further analysed as potentially containing some characters, whereas low entropy regions may be considered as the background. The additional steps of the morphological dilation, increase of contrast and final binarization using Bradley’s method are made during the final stage.

Method (3) - : The initial idea of the application of the region based binarization for text recognition was presented assuming the application of document images containing predefined text. The proposed improved method assumes the division of the image into regions of NÎN pixels. For each of the regions the local threshold can be determined as \(T=a*mean(X)-b\), where mean(X) is the average brightness of the image region and the parameters a and b are subjected to optimization. The algorithm is based on the same idea of calculation of the local thresholds as the average brightness corrected by two parameters, however the number of regions is higher than would result from the resolution of the image and therefore they partially overlap each other. In this case for each sub-region several threshold values are calculated depending on the number of overlapping blocks covering the sub-region. The resulting local threshold is determined as the average of the threshold values calculated for the number of regions dependent on the assumed number of layers and the overlapping factor. The rationale for such an approach is a better tolerance of rapid illumination changes with the ability of a correct image binarization.

2.4 University of São Paulo (USP), Brazil Team: Diego Pavan Soler

The “ ” binarization method chooses to downscale the input image, rather than using patching, and then rescaling the network output to the input original size. The network architecture used is based on DE-GAN [60], where we changed the input image to HSV, adjusted the hyperparameters and the training process, including image augmentation.

2.5 University of Fribourg, Sweden Team: Jean-Luc Bloechle

The “ ” binarization algorithm detects the background of the original image using small overlapping windows. First, each window calculates its median color using a quantized color palette. Then, the estimated background image is generated by interpolating the computed median pixels of the overlapping windows. Next, the background image is subtracted from the original image, and the resulting difference image is transformed into grayscale, keeping only the lowest RGB component. A binarization is done by Otsu’s algorithm. Detection and removal of small isolated connected components is made. The algorithm submitted in this competition is a faster and more accurate version of the previously submitted one in DocEng 2020 Binarization Competition [33].

2.6 Berlin State Library - Prussian Cultural Heritage Foundation, Germany Team: Vahid Rezanezhd, Clemens Neudecker and Konstantin Baierer

The “ ” algorithm is based on machine learning and it is in fact a pixel-wise segmentation model. The dataset used for training is a combination of training sets for binarization competitions in different years with pseudo-labeled images from our dataset in the Berlin State Library. A specific dataset has been produced for very dark or bright images. The model is based on a Resnet50-Unet [42].

2.7 Hubei University of Technology, China Team: Xinrui Wang, Wei Xiong, Min Li, Chuansheng Wang and Laifu Guan

The method comprises three main steps. Firstly, a morphological bottom-hat transform is carried out to enhance the document image contrast, and the size of a disk-shaped structural element is determined by the stroke width transform (SWT). Secondly, a hybrid pyramid U-Net convolutional network [27] is performed on the enhanced document images for accurate pixel classification. Finally, the Otsu algorithm is applied as an image post-processing step to yield the final image.

3 Quality Evaluation Methods

To evaluate the binarization algorithms relative to image quality, the scanned documents were clustered according to their features (print type and paper texture luminosity). This produced five document sets. The quality of the binary images was compared using the PSNR, DRDM, F-Measure (FM) and pseudo-FMeasure (Fps) [41], and Cohen’s Kappa [6, 12]. The final ranking is defined by sorting the ranking summation in ascending order, following the methodology introduced by [46]. The consistency of the global ranking with a carefully made visual inspection was also checked.

Cohen’s Kappa [12]

$$\begin{aligned} k = \frac{P_{O} - P_{C}}{1 - P_{C}}, \end{aligned}$$
(1)

compares the observed accuracy with an expected accuracy, indicating how well a given classifier performs. \(P_{O}\) is the number of correctly mapped pixels (accuracy) and \(P_{C}\) is calculated by using:

$$\begin{aligned} P_{C} = \frac{n_{bf} \times n_{gf} + n_{bb} \times n_{gb}}{N^{2}}, \end{aligned}$$
(2)

where \(n_{bf}\) and \(n_{bb}\) are the number of pixels mapped as foreground and background on the binary image, respectively, while \(n_{gf}\) and \(n_{gb}\) are the number of foreground and background pixels on the GT image and N is the total number of pixels. The Kappa coefficient has an excellent correspondence with the image-quality perception by human visual inspection of the resulting images. As indicated in [45], \(\kappa \) may be a good and easy to interpret image-quality evaluation measure for binary classifiers [6]. Thus, the top-twenty algorithms in image quality, ranked following [46], will have their \(\kappa \) coefficient and standard deviation (shown in parenthesis), together with the mean processing time and its standard deviation (also shown in parenthesis) presented in the tables of the results.

4 Processing-Time Evaluation

The 12 new algorithms assessed here were implemented by their authors. The purpose of the processing time evaluation here is to provide an order of magnitude of time elapsed for binarizing the whole datasets. The training-times for the AI-based algorithms were not considered. The competing 12 algorithms, printed in , were implemented using different programming languages, operating systems, and even for specific hardware platforms such as GPUs. They are compared against the other 49 algorithms in the literature, most of which were implemented by their authors or are available in image processing environments such as MatLab or ImageJ.

In the benchmaking, the following hardware was used:

  • CPU: Intel(R) Core(TM) i7-10750H CPU @ 2.6 GHz RAM: 32 GB

  • GPU: GeForce GTE 1650 4 GB

The algorithms were executed in the following platforms:

  • Windows 10 (version 1909):

    • Matlab: Akbari_1 [1], Akbari_2 [1], Akbari_3 [1], ElisaTV [3], Ergina-Global [23], Ergina-Local [24], Gosh [5], Howe [18], Ghosh [5], Jia-Shi [20], Gattal [15], Lu-Su [36], Michalak [33], Yasin [32],  [52],  [49],  [51], , , .

  • Linux Pop!_OS 20.04

    • Python with GPU: DilatedUNet [33]

    • Python Machine Learning with CPU: Calvo-Zaragoza [10],  [60], Doc-DLinkNet [68],  [32], , , , Yuleny [32].

    • C++: Bataineh [4], Bernsen [7], ISauvola [17], Niblack [40], Nick [25], Otsu [43], Sauvola [54], Singh [59], Su-Lu [61], WAN [39], Wolf [65].

    • Java: Bradley [9], dSLR [57], Huang [35], Intermodes [47], IsoData [63], Johannsen [21], KSW [22], Li-Tam [28], Mean [16], Mello-Lins [38], MinError [26], Minimum[47], Moments [62], Percentile [13], Pun [48], RenyEntropy [53], Shanbhag [56], Triangle [67], Wu-Lu [37], Yean-CC [66], YinYang [33], .

Some experiments have been done with the algorithms that can be executed on both OS. No significant processing time difference was noticed. That is likely due to the fact that the exact same hardware and up to date compilers have been used in all cases. Although the user programming languages were Matlab, Python and Java, it is known that they are often used as an API to lower level implementations, leading to smaller differences in time purely due to the programming language used. Also, modern compilers are all nearly equally efficient. That can be easily verified by the results, as some algorithms using GPU performed fast, while others were slow. As for the Matlab implemented ones, some were among the fastest, while others among the slowest methods. Thus, if some optimization is made, using modern versions of the compilers, most algorithms should have similar performance in different languages.

The mean processing time was used in the analysis. It is most important to remark that the primary purpose here is to provide the order of magnitude of the processing time elapsed by each of the algorithms. The mean processing time figures are presented only to the 20 top quality algorithms, for each of the datasets tested.

5 Test Sets and Results

If one does not take into account some external physical noises such as stains, fungi, dirt that may affect document images [29], two aspects play fundamental role in complicating the binarization of scanned documents: the luminosity and texture of the paper background, that suffers the natural aging process, and the back-to-front interference. The back-to-front interference [31], later called bleeding or show-through, happens when a document is typed or written on both sides of a sheet of paper and the opacity of the paper is such as to allow the back printing or writing to be visualized on the front side. Given the importance of such a noise, most of the selected images from the Nabuco and LiveMemory datasets have such types of noise, with different degrees of interference \(\alpha \) [30]. As already mentioned, four test sets of “real-world” 200 and 300 dpi scanned documents were used in this assessment.

The Nabuco and LiveMemory datasets used in the experiments here are part of the DIB - Document Image Binarization data set (https://dib.cin.ufpe.br/), which is part of the IAPR-TC10/TC11 open repository of document images [30].

From the Nabuco bequest of historical documents from the late XIX century, 20 images have been selected, which were subdivided into three clusters according to the average luminosity level of the background texture. Dark textures have an average luminosity of 147, mid texture of 193 and light texture of 220. A total of seven dark, seven light texture handwritten and six mid-dark texture typewritten documents were selected. From the LiveMemory project, five images with various configurations were selected. From the PRImA project, four images that belong to the Europeana Newspapers Project Dataset were used. The images have been selected in order to provide some variability between the datasets, but similar images within the datasets. The chosen datasets are representative of a large number of “real-world” documents of interest.

The ground-truth images used here were obtained by binarizing the original images with the ten best quality algorithms from the previous competitions [32, 33] in images similar to the ones chosen to this competition. Such images underwent a careful visual inspection. The three best binary images were merged by applying the AND logical operator. The resulting image underwent salt and pepper filtering. The resulting image was visually re-inspected and underwent an eventual manual cleaning.

5.1 Nabuco Dataset

The letters of Joaquim Nabuco (b. 1849/d. 1910), a Brazilian stateman who was the first Brazilian ambassador to the USA, and one of the most expressive figures in freeing black slaves in the Americas, are of great historical importance, and some of them are available in the DIB dataset. Three sets of images of the Nabuco bequest were used in the assessment here.

Table 1. Quality-time results for Nabuco, Light Texture, Handwritten Documents
Table 2. Quality-time results for Nabuco, Dark Texture, Handwritten Documents
Table 3. Quality-time results for Nabuco, Mid Texture, Typewritten Documents

5.2 The LiveMemory Dataset

The LiveMemory Project [34] was a pioneering initiative to build a digital library of the entire collection of proceedings of the technical events of the Brazilian Telecommunications Society (SBrT) led by the first author of this paper, back in 2007. The real challenge was to scan all the printed-only volumes, semi-automatically index all the papers, enhance image quality, and to binarize the images in way such as to allow all the volumes to be stored in one single DVD, which was handed to all members of the SBrT. The documents were scanned in 200 dpi, true-color and stored using the jpeg file-format with standard (1% loss). The LiveMemory dataset is clearly the one with smaller variation among images, as they are all “modern” documents, offset printed and have an uniform background with some back-to-front interference.

Table 4. Quality-time results for LiveMemory Test Set

5.3 PRImA

Europeana Newspapers: [11] Its main goal is to provide a representative collection of all the types of newspapers which are and/or might be subject of ongoing or future digitisation activities. As such, it is hosting scanned images, metadata and ground truth (a representation of the ideal result of a processing step like OCR or layout analysis) on the level of individual newspaper pages. Three of the selected images are from the Royal Library, in the Netherlands, and one is from the Austrian National Library, Austria.

Table 5. Quality-time results for PRImA Data Set

6 Result Analysis

This contest is designed to look at the tradeoff between binarization performance and computational time. There is no single winner. The 20 top performing teams are reported by dataset in Tables 1, 2, 3, 4 and 5. and appeared in the top ranked algorithms for all five datasets and the sister algorithm appeared in the top ranked for four of the datasets. s first and third algorithms both appeared three times in the rankings. appeared 3 times and appeared twice.

The average kappa values for the top 20 reported for each dataset fell in a narrow range from 0.75 to 0.94, The binarized images produced using the best quality algorithm for the test images, as one may expect, had very high visual quality. The ten top quality images for each of the sample images will be made available at the DIB website (https://dib.cin.ufpe.br/) immediately after ICDAR 2021.

The execution times varied more significantly than the performance as measured by kappa value. The median run time of the top performing algorithms was 1 s with 21% of the algorithms taking less than 0.1 ss. was the fastest of the ranked new algorithms competing this year. Nine of the algorithms took more than a minute on average to process the page, which for most applications will not be practical for the small performance benefit the algorithm may offer. that appeared in the top rankings for all datasets was also the algorithm that had the longest run time of all the ranked algorithms. The median run time for the algorithms published before 2010 was 0.11 s, whereas the median of the algorithms published 2010–2019 increased to 4.95 s and the median of those published in 2020 and 2021 is 7.39. The performance does not vary significantly between those groups.

7 Conclusions

This ICDAR 2021 Competition on Time-Quality Document Image Binarization shows that document image binarization is still a challenging task. The number of ways the problem can be made more difficult leads to demand to develop a new algorithm that can handle that one outlier case which others could not properly binarize. Machine-learning binarization algorithms are rising in providing better quality images, but some of the classic algorithms like IsoData [63] and Savoula [55] continued to appear in the top ranked algorithm list and they still provide very good, if not the best quality bitonal-image at a much lower time complexity.

Machine-learning binarization algorithms are rising in providing better quality images, but some of the classic algorithms still provide very good, if not the best quality bitonal-image at a much lower time complexity. It is important to remark that the training-time for the machine-learning based algorithms was not computed. Another point worth remarking is that some of those ML algorithms require computational resources that may be considered prohibitive, as some of the competing algorithms in the ICDAR 2019 Competition on Time-Quality Document Image Binarization [32] were still unable to run to all test images of the test sets used here.