Abstract
Resorting to extraction text techniques for Chinese heritage documents becomes an increasing need. Historic documents such as Chinese calligraphy usually were handwritten or scanned in low contrast so that an automatic optical character recognition procedure for document images analysis is difficult to apply. In this paper, we present a historic document image threshold based on a combination of Bradley’s algorithm and K-means. An adaptive K-means cluster as a pre-processing methods for document image has been used for automatically grouping the pixels of a document image into different homogeneous regions. In Bradley’s methods, every image’s pixel is set to black if its brightness is T percent lower than the average brightness of surrounding pixels in the window of the specified size, otherwise it is set to white. Finally, text bounding boxes are generated by concatenating neighboring word clusters with mathematical morphology method. Experimental results show that this algorithm is robust in dealing with non-uniform illuminated, low contrast historic document images in terms of both accuracy and efficiency.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The digital rubbing is a novel approach to promote and pass on Chinese traditional arts, as well as a new idea to protect stone relics. Historical printed documents, such as old books and rubbings, are being digitized and made available through software interfaces such as web-based libraries, for instance, there is a large collection of Chinese rubbing database keeps in UC Berkeley east Asian library [14, 15]. These documents are challenging for OCR (Optical Character Recognition, OCR) because it use non-standard fonts and suffer from printing noise, artifacts due to aging, varying kerning (space between letters), varying leading (space between lines), frequent line break hyphenation, and other image problems due to the conversion from print-to-microfiche-to-digital [1]. Segmenting heritage documents images into text from background is a crucial pre-processing step for automated reading of historical documents. A Chinese historical rubbing image has been shown in Fig. 1. Many stone texture patches have shown in background of rubbing image because of the nature factor (all characters have been carved out of stone, ink rubbings has reproduced from that stone). The histogram has showed in three bands with a unimodal distribution, it is difficult to calculate the specific threshold.
Document image binarization is a key step from the image processing to decreases the computational load and enables the utilization of the simplified analysis methods compared to 256 levels of grey-scale or color image information, which is also a basic technique in the computer vision. It makes the high-level computer vision tasks possible, so it plays an important role in the technology of the document image processing [2]. A number of promising techniques for document image binarization were implemented at the different literatures [3]. Generally, the methods that deal with document image binarization may be broadly categorized either globally or locally. Whether it is global or local binarization, threshold choice is a sensitive function of the local reflectance map. Specially, for low contrast scanned historic document image, it is difficult to improve on a fixed threshold centered between the extreme observed values, since too low a value will swamp the difference map with spurious changes, while too high a value will suppress significant change [11].The most classical threshold algorithm that is Otsu’s method [5], which maximizes the values of class variances to get optimal threshold. Maximum entropy was firstly proposed by Pun [12]. The purpose is to divide the gray-level histogram of image into separate classes and make maximum total entropy of all kinds of classes. An automatic threshold algorithm based on an iterative threshold selection is employed in the study [6–8]. At iteration [10], a new threshold Tn is established using the average of the foreground and background class means. The iterations terminate when the changes |Tn–Tn + 1| become sufficiently small [9]. Sawaki and Hagita demonstrated another specialized binarization method for textured and reverse-video (white-on-black) Japanese headlines [13]. Their method is based on the complementary relationship between characters and their backgrounds as indicated by similarity measure for black and white runs. From Fig. 1(b), the gray-level distribution did not revealed bimodal, it is important to take into consideration the amplitude transfer function of the specific scanner, as well as the spatial and gray-scale characteristics of the image. That is to say, the pixels on an image are highly correlated, i.e. the pixels in the immediate neighborhood possess nearly the same feature data. Therefore, the spatial relationship of neighboring pixels is an important characteristic that can be of great aid in imaging segmentation. Cluster techniques have taken advantage of this spatial information for image segmentation [4].
In this paper, an adaptive foreground and background clustering (FBC) approach to document image binarization, each pixel is assigned to a foreground cluster or a background cluster. The algorithm is based on adaptive K-means algorithm, where the cluster means are updated each time a data point is assigned to a cluster. Since only one background and one foreground is assumed (K = 2), that is to say, only two clusters are considered, which makes the overall implementation easier. Following by that, a median filter has been employed for salt and pepper noise removing. Finally, a Connected components labeling technique is devised to locate possible positions of Chinese character.
2 Proposed Work
Figure 1 shows a scan of a page from a Chinese historical book that the rubbing image gets darker shows a low-contrast image with its histogram. As it can be observed from the luminance histogram, all the values gather in the left of the three bands, so it is impossible to reliably locate a local minimum between histogram valleys.
K-means clustering is one of the popular algorithms in clustering and segmentation. It treats each image pixel (with R,G,B values) as a feature point having a location in space. The basic K-means algorithm then arbitrarily locates, that number of cluster centers in multidimensional measurement space. Each point is then assigned to the cluster whose arbitrary mean vector is closest. The procedure continues until there is no significant change in the location of class mean vectors between successive iterations of the algorithms. Firstly, we use an adaptive K-means cluster based segmentation to improve the performance of threshold image.
2.1 Bradley’s Algorithm
The main idea in Bradley’s algorithm is that compute the sum of real numbers \( f(x,y) \) (for instance, pixel intensity) over a rectangular region of the image. It could be called as integral images. To compute the integral image, we store at each location, \( I(x,y) \), the sum of all \( f(x,y) \) terms to the left and above the pixel \( (x,y) \). This is accomplished in linear time using the following equation for each pixel (taking into account the border cases),
After that, compute the \( s \times s \) average using the integral image for each pixel in constant time and then perform the comparison. If the value of the current pixel is \( t \) percent less than this average then it is set to black, otherwise it is set to white. The pseudo code for Bradley’s algorithm has been showing as following:
2.2 Overview of the Binarization Technique
Because there are some amount of ‘salt & pepper’ noise exist in the document image after Bradley’s binarization, median filtering is conducted. Finally, a morphology-based technique is devised to locate possible positions of Chinese Character. The detail process is shown in Fig. 2.
3 Experiments Performed on Chinese Rubbing Images
The algorithm is implemented in MATLAB. The algorithm is tested with many scanned Chinese rubbing document images, which contained characters of different fonts and size. The parameters in the Bradley’s binarization experiments are set as follows: W = 15, H = 15, and T = 5. The median filtering using MATLAB’s medfilt2 function, we have used neighborhoods of size 3-by-3 to remove noising. There chosen two examples of our adaptive threshold result are presented in Figs. 3 and 4. In order to compare our method with other different algorithm, the results of Sauvola’s, Ostu’s and Isodata’s method are shown in same figure, also. Figures 3 and 4 illustrate a text example with a very low contrast. Our method is able to segment most of the text in the image.
We tried to look for existing methods to deal with Chinese historic image binarization problem. After many experiments have been implemented, our technique can provide near perfect segmentation despite the very low illumination in the image. Sauvola’s technique fails at the characters detection. Ostu’s global technique and isodata method keep more noising in image.
4 Conclusion
In this paper, a Chinese rubbing document image binarization algorithm is developed from low contrast Chinese rubbing images. The proposed scheme is developed based on an adaptive K-means combined Bradley operation. We tested our scheme on many different Chinese rubbing, and obtained encouraging results. Because the validation procedure should handle the variety of handwritten characters and the ambiguity in the distinction of characters, extracted them from images is not always easy and is still a topic of future research.
References
Gupta, M.R., Jacobson, N.P., Garcia, E.K.: OCR binarization and image pre-processing for searching historical documents. Pattern Recogn. 40(2), 389–397 (2007)
Sauvola, J., Pietikäinen, M.: Adaptive document image binarization. Pattern Recogn. 33(2), 225–236 (2000)
Yan, H.: Unified formulation of a class of image thresholding techniques. Pattern Recogn. 29(12), 2025–2032 (1996)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)
Otsu, N.: A threshold selection using gray level histograms. IEEE Trans. Syst. Man Cybernet. 9, 62–69 (1979)
Bradley, D., Roth, G.: Adaptive thresholding using the integral image. J. Graph. GPU Game Tools 12(2), 13–21 (2007)
Wellner, P.D.: Adaptive thresholding for the DigitalDesk. Xerox, EPC1993-110 (1993)
Pappas, T.N.: An adaptive clustering algorithm for image segmentation. IEEE Trans. Signal Process. 40(4), 901–914 (1992)
Chang, C.I., Du, Y., Wang, J., et al.: Survey and comparative analysis of entropy and relative entropy thresholding techniques. In: IEE Proceedings - Vision, Image and Signal Processing, IET, vol. 153(6), pp. 837–850 (2006)
Sezgin, M.: Survey over image thresholding techniques and quantitative performance evaluation. J. Electron. Imaging 13(1), 146–168 (2004)
Huang, Z.K., Chau, K.W.: A new image thresholding method based on Gaussian mixture model. Appl. Math. Comput. 205(2), 899–907 (2008)
Hayes, B., Wilson, C.: A maximum entropy model of phonotactics and phonotactic learning. Linguist. Inq. 39(3), 379–440 (2008)
Mori, M., Sawaki, M., Yamato, J.: Robust character recognition using adaptive feature extraction. In: 23rd International Conference on Image and Vision Computing New Zealand, IVCNZ 2008, pp. 1–6. IEEE (2008)
http://vc.lib.harvard.edu/vc/deliver/home?_collection=rubbings
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant No. 61472173), the grants from the Educational Commission of Jiangxi province of China, No. GJJ151134.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Huang, ZK., Ma, YL., Lu, L., Rao, FX., Hou, LY. (2016). Chinese Historic Image Threshold Using Adaptive K-means Cluster and Bradley’s. In: Huang, DS., Han, K., Hussain, A. (eds) Intelligent Computing Methodologies. ICIC 2016. Lecture Notes in Computer Science(), vol 9773. Springer, Cham. https://doi.org/10.1007/978-3-319-42297-8_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-42297-8_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-42296-1
Online ISBN: 978-3-319-42297-8
eBook Packages: Computer ScienceComputer Science (R0)