Keywords

1 Introduction

The current scientific literature comprises a wide range of works related to image pattern recognition for scanned documents [1]. Scanning a document or book is the mechanical stage of the recognition process, which usually requires human interference to ensure correct handling and framing of the subsequent steps [2]. Among the various methods used for scanning, the most common are by using a flatbed scanner or a handheld camera [3]. Each of these methods has a significant impact on the required preprocessing and the final quality of the scanned document. Subsequently, the automated reading of scanned images of characters is done through a technique known as optical character recognition (OCR), in which adequate handling and precise preprocessing provide improved results [1, 3, 4].

The main objective of this paper is to develop an algorithm for an adequate preparation of documents, aimed at obtaining a low error rate through correct digitization, considering these documents would be handled by blind or visually impaired people without any help or supervision [5]. Hence, the user will be able to have experience reading books in real time, while using his hands only to flip the pages, guided by the program. A developed algorithm for a camera-based device capable of recognizing and, afterwards, aurally assisting the user during the handling and digitization stage in real time is also discussed in this study.

Likewise, this paper presents an assistive technology which enables blind and visually impaired people having access to books. Raising their horizon of consciousness, increasing their social inclusiveness and action area in diverse fields, opening possibilities for oneself as well for the wider community [6].

Digitizing whole books is known to raise the character recognition rate, due the fact that all converted pages can be post-processed following statistic criteria [2]. However, works that focus on the process of digitizing whole books while the pages are being browsed in real time [3] are very rare. This would allow visually impaired and blind people the experience of reading a book.

The handling of a book by blind people makes the automatic character recognition system more difficult as a whole. This is due to the fact that blind people cannot notice their hands or fingers overlapping any text in the book. In addition, the way the book is held also impacts in page distortion [5]. During the handling process, the book may change position and angle at each turned page, which could compromise the OCR results. Ways of overcoming these obstacles are discussed in the following sections.

This paper is organized as follows: Sect. 2 describes works that are related to this topic. Section 3 presents the necessary specifications for making this work functional. Section 4 includes the algorithm description and necessary explanations for its proper functioning. Section 5 presents the experimental results and discussion. Section 6 brings the conclusion and suggests future related works.

2 Literature Review

Several researches focus on the OCR technology for scanning documents and its inherent difficulties [1]. Compared to scanned images, pictures taken by cameras suffer from major distortions due to perspective and image warping [3]. These distortions are more common for pictures of open books and bound documents, and require page by page rectification [7]. Currently, two categories of techniques are able to perform this task: 3-D document shape reconstruction [2, 8, 9], which reconstructs digitally the shape of the document through existing information in the obtained image, but requires a larger computational effort; and 2-D document image processing [2, 10], which does not depend on previous information and works only with current document information for processing it, without the need of auxiliary hardware [5, 11]. Technically, skew is the deviation of text lines from its horizontal orientation in some degrees on a document [12, 13]. Projection profiles is one of several methods that can be used for the detection of this deviation [13] Although the results are considered effective, the accepted limits for deviation angles proposed by this method ought to be considered low for the purpose of this study. As visual impairment is unpredictable in this matter, the accepted angle along with the computational effort increase widely at the same time [13]. The angles used in the proposed method are larger than in real life, having in mind that the operational process will be done using the Vocalizer platform, depicted in Fig. 1, preventing wider angle discrepancy.

Fig. 1.
figure 1

Red arrows indicate the limit of the stage, where the user must lean the top of the book.

The Vocalizer Platform prototype consists in a USB ELP 75º Camera 8Mpx HD (no distortion lens) for taking the book images, and the lighting conditions properly leveled using a 22 W circular lamp, color temperature 6400 K, standard model. Figure 1 illustraste how these elements are set. Although not used for the tests presented in this paper, the algorithm was created to simulate the usage of the book with this platform and most difficulties aroused by utilizing it.

Methods based in Hough Transform find the better orientation of straight lines in text [14], but require a large computational effort. On the other hand, another group of methods utilizes nearest neighbors (neighbor clustering) in connected components to find a more accurate direction between them and their neighbor components [13].

A few methods can be employed to straighten up curved text line due to image distortion along with bound-document distortion. Ulges [15] utilizes a mapping of each letter into a rectangular cell of a given size and position, extracting information of each pixel in the page. The region segmentation of words, on the other hand, is done to detect the integrity of words within a page, so that each word detected as a “text area” corresponds to their respective row and column [5, 13, 16]. Text quality degradation [4, 17] can result from irregular illumination, which impairs letter recognition as a whole in uncontrolled environments. This can be contoured on a semi-controlled ambient light. In the method proposed in this study, this happens through a directional light attached to the camera support for the book digitization [5].

Based on the pioneering work of Chakraborty [3], this paper focuses on processing a video stream at a low velocity rate and high resolution, in order to aid visually impaired or blind people to handle books adequately, and digitize them with a higher hit rate. There are very few scientific researches related to real-time book digitization. The proposed method by Chakraborty, which utilizes video streaming for digitization purposes, is the most relevant research to the scope of this work.

The proposed function for the algorithm in this paper consists in a 1 frame per second high resolution video (13Mpx), which represents an average of 14 images under a semi-controlled 8Mpx ambient light, which is expanded to 13Mpx. Chakraborty [3] uses, instead, a 50 frame per second video in 2Mpx (1920 × 1080) on the video stream, in uncontrolled ambient. This frame stream is then utilized to verify if the page is entirely open or being turned, when only parts of the page appear. In order to verify the moment the user is not turning the page, Chakraborty [3] uses statistical stability of candidate lines as basis, in which the region of interest of the book is captured first and compared to its edges, using a straight line searching algorithm. Once the candidate lines are set, if there are any lines within the region of interest, the algorithm detects that the book page is being turned, signaling that it is a bad frame. If the candidate lines are not affected, it is considered as a good frame [3]. Nevertheless, this operation requires the book to be properly arranged. Moreover, due to the constant moving hands of the visually impaired or blind person during the process, any of the edge lines could be slightly modified in any direction in each frame, as well as the book, without being noticed.

A challenge is set: how can frames be stabilized when not only rotation, but also translation could occur?

In order to overcome this instability, the proposed method utilizes a video stabilization algorithm known as Point Feature Matching [18]. To stabilize the frames, both frame 1 and 2 of the same page are translated and rotated in the view to correspond one another. The user’s fingers represent approximately 5 % of the book area. Therefore, the difference in image 1 and 2 should be around 10 % in a threshold operation. If the result of the operation is above 10 % between these 2 frames, it means that the base of the book was moved. The algorithm must search and locate the regions that suffered wider energy change, in order to determine whether the fingers are or not present in the image. This process is based on the differential power stability between frames [19]. The difference between candidate lines statistical stability [3] and differential power stability is that the first must elect one over all the frames considered good, temporally, when the book pages are wide open and do not present obstacles, which means that warping may occur. A larger computational effort is required. On the other hand, in the latter technique, only the frames with low energy variation with the book wide open are processed, along with the book being stretched out, meaning that warping is naturally reduced. This technique requires a lower computational effort, and fits the needs of this work overall.

Fig. 2.
figure 2

Illustration of the regions that could be blocked by the hands during the process.

Fig. 3.
figure 3

Exemplification of hands blocking the text (grey color).

Fig. 4.
figure 4

Exemplification of the next frame, when hands are not present and the text appears.

3 The Specification

The hands removal algorithm and the respective text substitution necessarily need two images to be taken. The pages must be stretched with the hands of the individual. However, the hands cannot be placed in the same area in each image. The purpose of the algorithm is to replace the regions covered by the hands with the corresponding text area, as depicted in Fig. 2.

In order to accomplish this substitution, an image difference algorithm is applied. This algorithm compares two images and generates a new one, composed by the sum of those two. For instance, the algorithm compares two images by pixel – the images A and B (mold), the elements (pixels) are compared between Axy and Bxy, where A and B are any element, and x and y are defined according to the position of a determined element (line and column). In case there are different pixels in the same position, these pixels are highlighted, as shown in the following images:

In other words, only different pixels are maintained in the composed image. Figure 5 is the result of application of the difference algorithm in Figs. 3 and 4. To maintain the text in both versions of the document, the regions where there are differences are composed in a way to maintain the images united at a transparency level, instead of being entirely removed (Fig. 10).

Fig. 5.
figure 5

Difference detected by the algorithm.

Fig. 6.
figure 6

Angle detection and deskew operated by the algorithm.

Fig. 7.
figure 7

Example of red areas inserted in the original image to test the algorithm.

Fig. 8.
figure 8

Example of angle inserted in the original image to test the algorithm.

Fig. 9.
figure 9

Example of image that will be run through the algorithm.

Fig. 10.
figure 10

Resulting image after the steps of the algorithm succeeded.

However, before applying this algorithm, text alignment is needed. In order to do so, a rotation correction algorithm (deskew) is applied. The algorithm finds the existing text lines in the page and calculate its angles. This angle is found by using the rotation as parameter. Figure 6 illustrate this process:

In synthesis, the hand removal algorithm is processed as follows:

  1. (a)

    Adaptive threshold;

  2. (b)

    Rotation algorithm;

  3. (c)

    Image difference algorithm;

4 The Algorithm

Overall, the algorithm consists in taking one frame in high resolution per second. The captured image is then utilized to verify if the page is entirely open or being turned, when only parts of the page appear. If the page is entirely open, a beep is emitted and the book page should be stretched with the user’s hands. Once the image of the page is taken, another beep sounds and the user should reposition their hands. Another frame is then captured. The algorithm then deskews the images for a proper alignment. No dewarping is necessary because the page is stretched. The detection of areas covered by the user’s hands is done. Then, the images are combined, replacing the covered areas in a new image. In the first frame, as shown in Fig. 7, the finger region is marked red, and the algorithm defines this as “Finger_Region1” and cuts off the image in this region, discarding its contents. In the second frame, shown in Fig. 8, the process is repeated, and the red region is defined as “Finger_Region2”. The result is a new image of each one, respectively “CleanImage1” and “CleanImage2”, in which the finger regions are empty (Fig. 10).

In frame 1, the coordinates from “Finger_Region2” are copied into “CleanImage1”. In frame 2, the coordinates from “Finger_Region1” are copied into “CleanImage2”. Then “CleanImage2” is copied to “Finger_Region1” coordinates on the first frame, and on the second frame “CleanImage1” is copied into “Finger_Region2” coordinates. The resulting image is then binarized through a threshold process, so that the OCR might be more accurate in real-time detection.

To evaluate the process, page 32 from the book “Tolstoi, The Biography” was used. The book has been digitized using a flatbed scanner Plustek BookReader V200 [20]. The objective of this study is to evaluate the performance changes that the process causes in a character recognition precision level by OCR. Thus, images obtained by the process were utilized as OCR entry and the precision of the identified text was quantified. The text angle was the altered variable in the tests.

The book “Tolstoi, The Biography” was chosen for possessing plain and justified text, in a way to isolate any text segmentation questions that may affect the OCR performance. The flatbed scanner Plustek BookReader V200 was chosen as it is designed to be used by visually impaired and blind people. Since its functionalities include OCR and text-to-speech, the images obtained provide a favorable format for the OCR application.

No images were taken beforehand for the tests. In other words, images of documents being obstructed by hands on text regions were not utilized. There are many factors such as lighting, document position and other deformations that are still not predicted to be treated by this process. Hence, specially prepared images were used. To simulate the hands over the document, visual obstructions were added to the image via software. These obstructions consist in red circular areas of 999 × 999 pixels. The areas were added randomly so as to simulate hands, generating thus pairs of images to be utilized as a parameter by the image difference algorithm.

In addition, the software ImageMagick [21] was used to rotate the images in order to test the angle correction process. Pairs of images were generated, in which the first possessed a positive angle and the other, the same angle, but negative.

After the process is complete, the resulting images of the algorithm were submitted to an OCR software, Tesseract [22]. To measure the accuracy of the OCR performance in terms of character recognition, Google Diff [23] was used as a comparison algorithm. This algorithm can compare two text documents and measure differences between them. For example, considering text A and B as parameters for Google Diff, the following differences are evidenced: a) characters that exist in A but not in B, b) characters that are present in B but not in A. Thus, text images that were recognized by OCR were compared with the same exact respective to that image – which is known as ground truth. Ground truth of the images was obtained by applying OCR with Tesseract in each image and correcting the resulting text manually (twice), so that any possible identified errors could be eliminated. The text obtained by OCR was then compared with ground truth. The error percentage (performance) was obtained through the character quantity of the given text identified by the OCR that did not match its respective ground truth characters. In other words, if ground truth has 10 characters, and the text obtained by OCR presented 2 wrong characters, error percentage is 20 %. The images were all 96 DPI, converted from TIFF to PNG.

5 Results

The results reveal, as depicted in Table 1, that there are no significant alterations of OCR performance by the application of the suggested process, as all steps occurred properly.

Table 1. Test results showing different angles used in the tests and correspondent performance.

For instance, the OCR error rate in the unprocessed image is 1.9 %, in a total of 2432 characters. With the application of image difference, the error rate changed to 0.6 %. With angle correction and image difference, the error rate increased to 0.8 %, at an angle of 45 degrees. That is, the process may even improve performance compared to the unprocessed image accuracy. This probably is due to the fact that the images are limiarized through the process (Fig. 10).

Thus, the process is at a great risk if one of the steps fail. Figure 11 shows that at an angle variation of 7.5 degrees, there is an error rate of 100 % on character recognition due to a failure in the angle correction process.

Fig. 11.
figure 11

Resulting image after the angle correction failed in the algorithm.

In terms of comparison in the best scenario, Chakraborty [3] utilizing Tesseract OCR reaches 70 % character recognition accuracy through their process, while the presented algorithm in this paper the OCR accuracy reaches around 99,4 %. Although, the conditions that the experiments went through vary as the first experiment utilizes whole books and uncontrolled environment as parameters, while the second utilizes a single frame under a virtually controlled ambient.

The experiments were done utilizing a Intel Core Duo T2400 1,83 GHz processor, having 3 GB RAM DDR2, while Chakraborty [3] utilizes a 1.87 GHz CPU and 2 GB RAM.

6 Conclusion

This work proposed an algorithm capable of solving difficulties aroused by visually impaired and blind people handling books for digitization. The objectives were successfully reached, as shown in Table 1. Although a few improvements are needed, the results are promising. The rotation angles used in the tests are largely wider than the ones in real life, having in mind that the tendency is to obtain good results in practice due to guidance from the Vocalizer platform, in which the top of the book must be matched, preventing further angle discrepancy.

For future research, the difference algorithm should be applied only to hand-covered areas. This can be accomplished by applying the algorithm only in regions where the pixel is greatly different. The rotation algorithm needs to be further developed in order to be more accurate. The next tests should be done in real cases, with the subject’s hands instead of being simulated by a software, and in which ambient illumination is also considered. Moreover, the quality of the composed image must be improved, considering that the text image will be recognized by OCR afterwards.