Abstract.
We propose a novel approach to restoring digital document images, with the aim of improving text legibility and OCR performance. These are often compromised by the presence of artifacts in the background, derived from many kinds of degradations, such as spots, underwritings, and show-through or bleed-through effects. So far, background removal techniques have been based on local, adaptive filters and morphological-structural operators to cope with frequent low-contrast situations. For the specific problem of bleed-through/show-through, most work has been based on the comparison between the front and back pages. This, however, requires a preliminary registration of the two images. Our approach is based on viewing the problem as one of separating overlapped texts and then reformulating it as a blind source separation problem, approached through independent component analysis techniques. These methods have the advantage that no models are required for the background. In addition, we use the spectral components of the image at different bands, so that there is no need for registration. Examples of bleed-through cancellation and recovery of underwriting from palimpsests are provided.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Amari S, Cichocki A (1998) Adaptive blind signal processing - neural network approaches. Proc IEEE 86:2026-2048
Attias H (1999) Independent factor analysis. Neural Comput 11:803-851
Avi-Itzhak HI, Diep TA, Garland H (1995) High accuracy optical character recognition using neural networks with centroid dithering. IEEE Trans Patt Anal Mach Intell 17:218-224
Barros AK (2000) The independence assumption: dependent component analysis. In: Girolami M (ed) Advances in independent component analysis, chap 4. Springer, Berlin Heidelberg New York, pp 63-71
Bell AJ, Sejnowski TJ (1995) An information maximization approach to blind separation and blind deconvolution. Neural Comput 7:1129-1159
Cardoso JF (1999) High-order contrasts for independent component analysis. Neural Comput 11:157-192
Dubois E, Pathak A (2001) Reduction of bleed-through in scanned manuscript documents. In: Proceedings of the IS&T conference on image processing, image quality, image capture systems, Montreal, 22-25 April 2001, pp 177-180
Easton RL (2001) Text recovery from the Archimedes Palimpsest. +http://www.cis.rit.edu/+ +people/faculty/easton/k-12/exercise/index.htm+
Franke K, Köppen M (2001) A computer-based system to support forensic studies on handwritten documents. Int J Doc Anal Recog 3:218-231
Govindaraju V, Srihari N (1991) Separating handwritten text from overlapping nontextual contours. In: Proceedings of the international workshop on frontiers in handwriting recognition, Chateau de Bonas, France, September 1991, pp 111-119
Hyvärinen A (1999a) Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans Neural Netw 10:626-634
Hyvärinen A (1999b) Gaussian moments for noisy independent component analysis. IEEE Signal Process Lett 6:145-147
Hyvärinen A, Karhunen J, Oja E (2001) Independent component analysis. Wiley, New York
Hyvärinen A(2003) The FastICA package for MATLAB. +www.cis.hut.fi/projects/ica/fastica/+
Knuth K (1998) Bayesian source separation and localization. Proc of the SPIE: Bayesian inference for inverse problems, vol 3459, San Diego, July 1998, pp 147-158
Kuruoglu E, Bedini L, Paratore MT, Salerno E, Tonazzini A (2003) Source separation in astrophysical maps using independent factor analysis. Neural Netw 16(3-4):479-491
Lee SE, Press SJ (1998) Robustness of Bayesian factor analysis estimates. Commun Statist Theory Meth 27(8):1871-1893
Lee T, Lewicki M, Sejnowski T (1999) Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Comput 11:409-433
Leedham G, Varma S, Patankar A, Govindaraju V (2002) Separating text and background in degraded document images - a comparison of global thresholding techniques for multi-stage thresholding. In: Proceedings of the 8th international workshop on frontiers in handwriting recognition, Niagara on the Lake, Canada, 6-8 August 2002, pp 244-249
Mohammad-Djafari A (2001) A Bayesian approach to source separation. AIP Conference proceedings 567:221-244
Moulines E, Cardoso JF, Gassiat E (1997) Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. In: Proceedings of the ICASSP, Munich, Germany, 21-24 April 1997, pp 3617-3620
Nishida H, Suzuki T (2002) Correction show-through effects in document images by multiscale analysis. In: Proceediongs of the 16th conference on pattern recognition, Quebec City, Canada, 11-15 August 2002, pp 65-68
Sharma G (2001) Show-through cancellation in scans of duplex printed documents. IEEE Trans Image Process 10(5):736-754
Tan CL, Cao R, Shen P (2002) Restoration of archival documents using a wavelet technique. IEEE Trans Patt Anal Mach Intell 24(10):1399-1404
Tonazzini A, Bedini L, Kuruoglu EE, Salerno E (2001) Blind separation of time-correlated sources from noisy data. Technical Report TR-42-2001 IEI-CNR, Pisa, Italy
Tonazzini A, Bedini L, Kuruoglu EE, Salerno E (2003) Blind separation of auto-correlated images from noisy mixtures using MRF models. In: Proceedings of the 4th international symposium on independent component analysis and blind source separation, Nara, Japan, 1-4 April 2003, pp 675-680
Tong L, Liu RW, Soon VC, Huang Y-F (1991) Indeterminacy and identifiability of blind identification. IEEE Trans Circuits Sys 38:499-509
Author information
Authors and Affiliations
Corresponding author
Additional information
Received: 15 April 2003, Accepted: 17 December 2003, Published online: 22 April 2004
Correspondence to: Anna Tonazzini
This work has been partially supported by the European Commission project “Isyreadet” (http: //www.isyreadet.net), under contract IST-1999-57462
Rights and permissions
About this article
Cite this article
Tonazzini, A., Bedini, L. & Salerno, E. Independent component analysis for document restoration. IJDAR 7, 17–27 (2004). https://doi.org/10.1007/s10032-004-0121-8
Issue Date:
DOI: https://doi.org/10.1007/s10032-004-0121-8