1 Introduction

1.1 Overview

One critical step of the analysis is to identify and retrieve foreground and background objects correctly; however it is not easy to binarize and find the best thresholds because of change of illumination or noise presumed issues.

For document images of a good quality, global thresholding [13] is capable of extracting the document text efficiently. But for document images suffering from different types of document degradation, adaptive thresholding, which estimates a local threshold for each document image pixel [48], is usually capable of producing much better binarization results.

Some binarization methods try to incorporate global or local approaches, like [912, 16]. Certain methods also have incorporated background estimation and normalization steps, like [1113]. The image edges that can usually be detected around the text stroke boundary is also used in certain binarization methods, like [1315].

Because Sauvola’s binarization is widely used in practice and gives good results on document images, this paper focuses on that particular method.

1.2 Sauvola’s Algorithm and Issues

Sauvola’s method [5] takes a grayscale image as input. Since most of document images are color images, converting color to grayscale images is required [17].

From the grayscale image, Sauvola proposed to compute a threshold at each pixel using:

$$ T = m \times \left[ {1 + k \times \left( {\frac{s}{R} - 1} \right)} \right] $$
(1)

where \( k \) is a user-defined parameter, \( m \) and \( s \) are respectively the mean and the local standard deviation computed in a window of size \( \omega \) centered on the current pixel and \( R \) is the dynamic range of standard deviation (\( R = 128 \) with 8-bit gray level images). The size of the window used to compute \( m \) and \( s \) remains user-defined in the original paper.

The main advantages of Sauvola’s method is that it performs relatively well on noisy and blurred documents [18] and its computational efficiency.

Due to the binarization formula (1), the user must provide two parameters \( (\omega ; k) \). Some techniques have been proposed to estimate them. [19] state that \( \omega = 14 \) and \( k = 0.34 \) is the best compromise for show-through removal and object retrieval quality in classical documents. [20] based the parameter research on Optical Character Recognition (OCR) result quality and found \( \omega = 60 \) and \( k = 0.4 \). [5, 18] used \( \omega = 15 \) and \( k = 0.5 \). Adjusting those free parameters usually requires an a priori knowledge on the set of documents to get the best results. Therefore there is no consensus in the research community regarding those parameter values.

In [21], a learning framework for the optimization of the binarization methods is introduced, which is designed to determine the optimal parameter values for a document image.

Sauvola’s method suffers from different limitations among the following ones [22].

  • Missing low-contrast objects.

  • Keeping textured text as is.

  • Handling badly various object sizes.

  • Spatial object interference.

In the remainder of this paper, we present a method to overcome one of the four limitations of Sauvola’s binarization mentioned previously, which is the Missing of low-contrast objects.

The rest of the paper is structured as follows. In Sect. 2 we first expose the general principle of the proposed method. In Sect. 3 we present some results of the proposed method applied to real documents and compare them to other methods’ results. We conclude on the achievements of this work in Sect. 4.

2 Proposed Method

An improvement of Sauvola’s algorithm is introduced in this section. This results in a three-step process: (1) An initialization map is extracted from the document image to identify high-probability text pixels; (2) Sauvola’s algorithm is applied on the input document image; and, finally, (3) To produce the final binarization, we just have to detect in sauvola’s binarization image the set of pixels overlapping with each text pixel of the initialization map.

In the next subsection, an initial binarization is estimated using the image contrast.

2.1 Step 1: Initialization Step

At this step, we use an initialization approach that is based on image contrast to identify high-probability text pixels. The used initialization first constructs a contrast image, evaluated by the local maximum and minimum, and then detects the high contrast image pixels which usually lie around the text stroke boundary.

  • Contrast Image Construction

In the proposed technique, the used image contrast (Fig. 1(b)) is calculated based on the local image maximum and minimum [15] as follows:

$$ D\left({x,y} \right) = {{{f_{max}}\left({x,y} \right) - {f_{min}}\left({x,y} \right)} \over {{f_{max}}\left({x,y} \right) + {f_{min}}\left({x,y} \right) + \epsilon}}, $$
(2)

where \( f_{max} \left( {x,y} \right) \) and \( f_{min} \left( {x,y} \right) \) refer to the maximum and the minimum image intensities within a local neighborhood window. In the implemented system, the local neighborhood window is a \( 3 \times 3 \) square window. The term ϵ is a positive and very small number, which is added in case the local maximum is equal to 0.

Fig. 1.
figure 1

High contrast pixel detection: (a) input image, (b) image contrast, (c) high contrast image pixels and (d) is (c) after morphological opening operation

The image contrast in (2) minimizes the image background and brightness variation properly. In particular, the numerator captures the local image difference that is similar to the traditional image gradient. The denominator (the normalization term) is used to avoid an artifact of uneven background and lowers the effect of the image contrast and brightness variation [15].

  • High Contrast Pixels Detection

The purpose of the contrast image construction is to detect the desired high contrast image pixels lying around the text stroke boundary. As described in the last subsection, the constructed contrast image has a clear bimodal pattern where the image contrast around the text stroke boundary varies within a small range but is obviously much larger compared with the image contrast within the document background. We therefore detect the desired high contrast image pixels by using Otsu’s global thresholding method (Fig. 1(c)).

To remove the small objects, we use the morphological open binary image operatorFootnote 1 (Fig. 1(d)). As can be seen in Fig. 1(d) some faint characters or low contrasted text pixels was suppressed but this issue will be solved at the third step.

2.2 Step 2: Sauvola’s Binarization Step

At this step, Sauvola’s thresholding described in Sect. 1.2 is performed on the input document image.

In our experiments, we found that the value of \( R \) in Eq. (1) has a very small effect on the binarization quality while the values of \( k \) and window size affect it significantly. Low contrasted objects may be considered as textured background or show-through artifacts due to the threshold formula and may be removed or partially retrieved. A low value of \( k \) parameter can help retrieving low-contrasted objects but since it is set for the whole document, it also alters other parts of the result: correctly contrasted objects are thicker in that case, possibly causing unintended connections between components. This is due to the fact that background noise and artifacts are usually poorly contrasted and are retrieved as objects.

The size of the window is an important parameter to get good results, too low a value may lead to broken characters and/or characters with holes whereas too large a value may lead to bold characters. Its size must depend on the contents of the document.

An optimal combination of \( k \) and \( \omega \) will produce a good binary image. In our experiments we choose a low value for \( k \) parameter to detect all the text pixels (low and correctly contrasted) and a low value for \( \omega \) parameter to reduce the overlapping between characters.

2.3 Step 3: Sequential Combination

At this step, we use a sequential combination between the contrast image and Sauvola’s binarization image to obtain the final binary result.

The sequential combination consist in detecting in Sauvola binarization image the set of pixels overlapping with each text pixel of the high contrast image as described in Algorithm 1.

The proposed method described was implemented in Matlab; the results are presented and discussed in the following section.

3 Experiments and Discussion

The described method has been tested on the document images used in the Document Image Binarization Contests (DIBCO) that suffer from different types of document degradation. We also compare our method with other well-known binarization methods including Sauvola’s thresholding method [5].

Multiple tests performed on document images have demonstrated that the following parameters: \( \omega = 15 \times 15 \), \( R = 128 \) as recommended in [5] and \( k = 0.01 \) give the best binarization results. A low value of \( k \) parameter can help retrieving low-contrasted objects, since it is set for the whole document, it also alters other parts of the result: lot of background noise and artifacts are retrieved as objects but our proposed sequential combination (the third step) can suppress the noise efficiently because it suppresses the contrast of the document background through the normalization as described in Sect. 2.1

The binarization results in Figures below show the superior performance of the proposed thresholding technique.

Figure 2(a) illustrates a faint characters degraded document image. Our binarization technique first constructs a contrast image by the local image maximum and minimum and then extracts the high contrast image (Fig. 2(b)) which is used to suppress the background noise and artifacts. Then Sauvola’s algorithm is applied on the input image (Fig. 2(c)) to detect all the text pixels (low and correctly contrasted). After that, we combine sequentially between the two results by detecting in Sauvola binarization image the set of pixels overlapping with each text pixel of the high contrast image to produce the final binarization. As can be seen in Fig. 2(f) (the final result) the faint characters are reasonably well detected by using our method. On the other hand, Sauvola’s method produces a lot of noise due to the variation within the document background, Su’s method and Howe’s method cannot detect some weak characters .

Fig. 2.
figure 2

Image HW2 from the DIBCO’11 test dataset: (a) input image, (b) high contrast image, (c) binarization result obtained using Sauvola’s method, (d) Su’s method (ranked 2nd in DIBCO’11), (e) Howe’s method (ranked 3rd in DIBCO’11) and (f) the result of the proposed method

Figure 4(a) illustrates a bleed-through degraded document image, as can be seen in Fig. 4(c) the bleed-through is reasonably well removed by using our method. The proposed method can suppress more noise than Sauvola’s method because it suppresses the contrast of the document background through the normalization in the initialization step. As a comparison, Sauvola’s method simply classifies dark background pixels as the text pixels improperly.

Figures 3(a) and 5(a) illustrate a faint characters degraded document image. Figure 5 shows that the proposed technique is tolerant to the variations in document contrast and able to binarize faint characters and badly illuminated image with little background noise where some other methods may either introduce a certain amount of noise or fail to detect the document text with a low image contrast shown in Fig. 3(c).

Fig. 3.
figure 3

Image H12 from the HDIBCO’12 test dataset: (a) input image, (b) binarization results obtained using Sauvola’s method, (c) Lelore’s method (ranked 2nd in H-DIBCO’12) and (d) the proposed method

Figures 2, 3, 4 and 5 further show four document binarization examples. As shown, our proposed method extracts the text properly from document images that suffer from different types of degradation. On the other hand, Sauvola’s method often produces a certain amount of noise due to the variation within the background.

Fig. 4.
figure 4

Image HW4 from the DIBCO’13 test dataset: (a) input image, (b) binarization results obtained using Sauvola’s method and (c) the proposed method

Fig. 5.
figure 5

Image H04 from the HDIBCO’10 test dataset: (a) input image, (b) Sauvola’s method and (c) the proposed method

4 Conclusion and Future Prospects

This paper presents an efficient historical document image binarization technique that is efficient against different types of document degradation such as faint characters and uneven illumination. The proposed technique makes use of Sauvola’s algorithm and the image contrast that is evaluated based on the local minimum and maximum. Such a combined method leads, as shown in the experiments, to high accuracy when applied to historical document images, with a variety of degradations. The proposed method succeeds indeed to capture low and correctly contrasted objects inside a single document. However, the performance of our binarization method is limited in case of both small and large objects in a same document; Sauvola’s method fails to retrieve all objects correctly because its window parameter is defined for an image as a whole. In future work we will focus on handling various object sizes.