1 Introduction

The age of digitization has made it mandatory to have digital records to make the data easily available. Philip et al. [1] observed that hospitals look after digitization of medical records in order to reduce costs and also improve accessibility to records. This medical record is extremely important from patient as well as from doctor’s perspective, as it serves a basis for early diagnosis and tracking the history of the ailment. The hospitals which are equipped with the usage of digitized records showing better performance in terms of less time for searching the records and more time for patient care as compare to hospitals with paper records [1]. In such scenario, it is observed that there is a need for hospitals to move towards process of paperless environment, wherein a tool is required to convert the existing paper records into their digital form in a simple and efficient manner.

Usually, printed documents are first scanned and then saved in the form of image in order to store the information present in them. However, utilizing this available information by only reading the text within images is quite a difficult task. Optical Character Recognition (OCR) is one of the active research areas that involves number of attempts to develop an automated system that extracts and processes text from the image files. The objective of OCR is to provide an editable digital format by modifying or converting any form of text from document images such as handwritten and printed scanned texts for deep and further processing [2]. The OCR mainly focuses on the recognition of two types of characters i.e. machine printed and handwritten. The varied characteristics makes them a challenging area of research [3]. Number of challenges such as variations in font size, font colour, background colour, image resolution, etc. which the OCR faces while performing the digitization, and thus, makes it vulnerable to inaccuracies [2]. This creates the pre-processing of images a crucial step wherein the characteristics of images are enhanced. Therefore, there is a need to carry out by taking in consideration the characteristics of the OCR, as every OCR Engine responds differently to the characteristics of the image e.g. resolution, colour correction, contrast etc. Subsequently, Braille system is developed for written communication to help visually impaired persons wherein language is converted into Braille characters. This optical Braille recognizer reads pattern from images and translate into text [21]. In [22], character recognition on Offline handwritten is investigated using machine-encoded text and support vector machine (SVM) is applied for classification in forensic applications. Here, local features from the image are extracted to improve the accuracy, moreover, 74.32% of accuracy is reported for character recognition.

Even though number of pre-processing techniques developed in the past, these techniques are dependent upon the category of images and which features should be extracted from these images. Most of the available comparative studies are mainly focused on a specific characteristic of the image, i.e., from where data needs to be extracted, however, a study which compares various set of tools and techniques based on the utilization is needed. Research in the digitization of medical lab reports is still one of the unexplored areas. Study relating to the aftermath or advantages of medical data digitization is found quite abundantly but investigation regarding optimal approaches to digitize the data relating to specific medical domains is quite insufficient. In the propose work, we focus on digitizing medical lab reports which are a major resource in understanding patients' health charts and his/her medical condition changes over a period of time in response to the treatment they are receiving.

The paper is organized as follows; Sect. 2 discuss on OCR engines and various image preprocessing techniques. Literature review based on the medical report generation is presented in Sect. 3. Section 4 highlights the proposed methodology for medical report digitization. Results and discuss is found in Sect. 5, finally the conclusion and future scope is presented in Sect. 6.

2 OCR engines and image pre-processing techniques

In this paper, we have mainly focused upon three open-source OCR Engines detailed in [10, 13, 17] and six pre-processing techniques, which is mentioned in [14] to draw necessary conclusions for this work. Our attention while choosing the OCR engines is to find out open-source, easily accessible and robust engines so that the target user can develop their applications without suffering from any licensing issues.

2.1 OCR engines

Here, three open-source OCR Engines are described and there technical details are presented as follows:

Tesseract OCR: An open-source text recognition (OCR) Engine, available freely under the Apache 2.0 license. It can be run both from the command line and through GUI Interface (using third Party compatibility) [18]. It consists of a fully-featured API and can be compiled for a variety of targets including Android and the iPhone. It is also available for Linux, Windows and Mac OS X platforms. The development of tesseract is focused on line finding, features/classification methods, and the adaptive classifier for achieving the best accuracy [9].

EasyOCR: It is ready-to-use, open-source OCR with support for 80 + languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic, etc. EasyOCR is adaptable to any state-of-the-art model plug-in [5]. In Fig. 1 the flow of the development of EasyOCR is shown [19].

Fig. 1
figure 1

Easy OCR framework

docTR: An open source OCR engine available under Apache 2.0 license. docTR is powered by TensorFlow 2 and PyTorch. The text detection and text recognition is performed using the following techniques [6].


Text detection

  • Double-Butterfly Network (DBNet): Real-time scene text detection with differentiable binarization.

  • LinkNet: Exploiting encoder representations for efficient semantic segmentation.


Text recognition

  • Convolutional recurrent neural network (CRNN): The combination of two of the most prominent neural networks, CNN (Convolutional Neural Network) followed by the RNN (Recurrent Neural Networks). Optimal neural networks model for image-based sequence recognition and its application to scene text recognition.

  • Show, Attend and Read (SAR): A simple and strong baseline for irregular text recognition.

  • Multi-Aspect Non-local Network for Scene Text Recognition (MASTER): A scene text recognizer targeting to identify characters in a sequence from image with straight text.

2.2 Pre-processing

Image binarization: Image binarization is the process of converting a document image from 3D pixels array format into 2D format, i.e. into a bi-level document image. Image pixels are separated into a dual collection of pixels, i.e. black and white. The main goal of image binarization is the segmentation of documents into foreground text and background.

Brightness transformation: Here, the brightness information is given by the Eq. (1):

$$g(x) = \alpha f(x) + \beta , \alpha >0$$
(1)

The parameters \(\alpha\) = gain and β = bias parameters are said to control contrast and brightness, respectively. The image brightness and contrast vary as we change \(\alpha\) and β.

Gamma correction or power law transform: Gamma correction carries out a non-linear operation on the source image pixels and can cause saturation of the image to be altered. Gamma correction simply can be defined by the following power-law expression:

$${V}_{out}=A{V}_{in}^{\gamma }$$
(2)

where, the non-negative real input value Vin is raised to the power γ and multiplied by the constant A to get the output value Vout. In the common case of A = 1, inputs and outputs are typically ranging from 0–1.

Sigmoid stretching: Sigmoid function is a continuous nonlinear activation function. Statisticians call this function as the logistic function.

$$g\left(x,y\right)=\frac{1}{1+{e}^{(c(th-{f}_{s}(x,y)))}}$$
(3)

Here, g(x, y): enhanced pixel value, c: contrast factor, th: threshold value and fs(x, y): original image.

The amount of lightening and darkening can be varied to control the overall contrast enhancement by adjusting the contrast factor ‘c’ and threshold value [20].

Image smoothing (bilateral filtering): Image blurring is achieved by convolving the image with a low-pass filter kernel, which helps in removal of noises. Here, a bilateral filter is a Gaussian filter in space and is highly effective in noise removal while keeping edges sharp.

Image sharpening (Kernel method): High-pass filter is used to emphasize the fine details in the image to enhance the sharpness of the image. It is opposite to that of low-pass filter and uses a different convolution kernel than a low-pass filter.

$$Kernel=\left[\begin{array}{ccc}0& -1& 0\\ -1& 5& -1\\ 0& -1& 0\end{array}\right]$$

3 Literature review

It is noted from the literature review that research in the digitization of medical lab reports is still one of the unexplored areas. Most of the available comparative studies are focused on a specific area (such as image-smoothening) in the process of digitizing printed or handwritten data, however, a collective study for a complete process model is needed. It is found out that the study relating to the advantages of medical data digitization is quite abundant, but studies regarding optimal approaches to digitize data relating to specific medical domains are quite insufficient.

Scott et al. [1] discussed how hospitals with the usage of digitized records show a decrease in time to search for records and an increase in time for direct patient care, compared to hospitals with paper records. Suter et al. [2] projected the importance of big data analytics in the digitization of medical data to enable value-based healthcare. To further knowing the nuances of the OCR techniques, [4] and [7] are helpful in understanding all phases of OCR and understanding the problems that are faced during text recognition [3]. Mello et al. [8] detailed the best value of parameters for digitization (resolution, brightness, contrast, number of colors, etc.) that offers an overview of the impact on changing parameters and methods consists of OCR and their accuracy.

In OCR, image binarization helps in representation of textual data to ease the process of character recognition. The binarization in old documents as an image is processed in [23] and enhanced performance is reported. Authors, claimed that the proposed system is better for images with degraded documents. In [24], segmentation and recognition of document image is investigated using Dijkstra’s algorithm and wavelet transform is applied for word segmentation. The accuracy of 98.1% is achieved for word segmentation and 97.6% of accuracy obtained for text-line segmentation, respectively. Subsequently, Tesseract-OCR is adopted for lot number detection in counting the stocks [25]. This system is proposed in hospital during COVID-19 to detect the stock quantity along with lot number of items that persist in hospital.

The novel pre-processing method is presented in [11] for scanned document wherein it detects and corrects the skew. The proposed method, minimizes the area of interest, independent of content and no limitation of skew angle. Subsequently, attentive generative network is employed to solve the problem of image denoising by embedding visual attention [12]. Due to visual attention, the noise region is attracted and forms the balance between removal of noise and preservation of texture. In [15], the detailed study is proposed on improving the output quality for the OCR system. Moreover, it very vital to digitized the medical information [16] to offer suitable techniques to process the medical data under different disease conditions. Table 1 present the detailed findings from the literature review performed.

Table 1 Summary of literature review

4 Method and materials

It is seen that every OCR technique responds differently to the pre-processing techniques, which are applied to the input image. Herein, we have devised an approach to perform a combinatorial and permutation analysis using the set of pre-processing techniques that outputs the most valid sets with the best accuracy. The approach is divided into the following different sub-parts.

Lab reports: The input contains the lab reports which are in Portable Network Graphics (PNG) image format. A collection of lab reports is considered for performing this comparative study.

Image pre-processing techniques: The standalone pre-processing techniques on which the combinatorial and permutation study will be performed. These techniques will then be fed to the optimizer for further calculations.

Optimizer: The optimizer receives the pre-processing techniques as an input and uses the three OCR Engines. It then calculates the accuracy of text extraction after applying each pre-processing technique separately on all OCR engines. The technique whose accuracy is greater than the threshold accuracy (average of all accuracies) is supplied to the ‘Combination Generator’ for performing further analysis. The technique which performs below the threshold accuracy is discarded in this process.

Combination generator: The Combination Generator receives the input from Optimizer. Here, 2n+1 − 1 combinations from the input set are generated, where ‘n’ is the number of pre-processing techniques. For each combination, k! permutations are generated where ‘k’ is the length of a combination. For each permutation, pre-processing techniques are applied to the image in the order specified from the permutation. The image is then fed to each OCR and accuracy is calculated for the text extraction. A dictionary is maintained with key as permutation and value as accuracy which is then utilized to produce the final results. Figure 2 shows an example of how the combination generator gets its input and after that how it generates various combinations.

Fig. 2
figure 2

Pre-processing techniques flow

Accuracy calculator: To calculate accuracy, comparison the ground truth values of text with the extracted values of text from the lab reports is performed. The scoring is on the basis of proximity of extracted word to the original word, which is calculated using the Levenshtein distance. The scoring is explained in Table 2.

Table 2 Levenshtein distance scores
$$Accuracy=\frac{\sum (score)}{Total \,No. \,of \,words \,in \,original \,image}\times 100$$
(4)

Optimized results: Using the dictionaries obtained from the ‘Combination Generator’, results are produced. The results is computed based on various parameters, presented as follows:

  1. 1.

    Accuracy without pre-processing vs accuracy of best combination of pre-processing.

  2. 2.

    Most suitable pre-processing techniques for each OCR.

  3. 3.

    Comparison of execution time of the OCR Engines.

  4. 4.

    Top five combinations of pre-processing for the best performing OCR.

Here, Fig. 3 shows how the aforementioned sub-parts form the flow of the proposed methodology.

Fig. 3
figure 3

Proposed system for identification of medical data

5 Results and discussion

For the proposed methodology, we conducted a detailed comparative analysis of OCR engines and combinations of pre-processing techniques on a dataset of 39 images/lab-reports. The system configuration for performing the experimentation are—Intel i5 8th Generation CPU with 16 GB DDR4 RAM and 4 GB Nvidia 1050 GPU. Here, Fig. 4 presents a comparison between the text extraction accuracy before any pre-processing techniques are applied and after the pre-processing techniques are adopted with the best-identified combination. Based on the OCR text extraction accuracy comparison graph, it is observed that the average accuracy of text extraction is improved by 2.88%. The average text extraction accuracy of 91.9% is reported, after applying the pre-processing techniques. Table 3 shows the most favorable pre-processing techniques for the OCRs. From this work, it is concluded that ‘Tesseract-OCR’ is 87% faster when compare to ‘Easy-OCR’ and 89% in comparison with ‘docTR-OCR’. After ‘Tesseract-OCR’, ‘Easy-OCR’ engine stands in second position in terms searching, which is found to be 12% faster than ‘docTR-OCR’. In Fig. 5, top five combinations of pre-processing technique for Tesseract OCR are presented, whereas Table 4 shows the corresponding pre-processing techniques.

Fig. 4
figure 4

Accuracy before pre-processing vs accuracy after pre-processing

Table 3 Top 2 pre-processing methods for each OCR
Fig. 5
figure 5

Top five combinations w.r.t their frequency for Tesseract-OCR

Table 4 Method combination and the corresponding pre-processing techniques order

Table 5 depicted the overall comparison between the three OCRs based on average accuracy without pre-processing, with pre-processing and execution time (per-image). When the whole dataset is considered, it is found out that ‘Tesseract-OCR’ is 87% faster than ‘Easy-OCR’ and 89% faster in comparison with ‘docTR-OCR’. The ‘Easy-OCR’ is the second fastest engine, after ‘Tesseract-OCR’. Moreover, ‘Easy-OCR’ is 12% faster, compared with ‘docTR-OCR’. In Table 6, compatibility of a given pre-processing method with a given OCR is presented.

Table 5 Overall comparison between the 3 OCRs
Table 6 OCR engine and pre-processing technique compatibility

6 Conclusion and future work

In this paper, we present the detailed study, indicating the noticeable increase in the text extraction accuracy for all the OCR Engines, when the right set of pre-processing techniques are applied before performing the text extraction. We have also identified the favorable pre-processing techniques for each OCR engine and the most optimal combinations amongst these pre-processing techniques to achieve the best text extraction accuracy. This work provides useful insights of a document with a structure similar to that of a medical lab report, where one of the pre-processing techniques can be chosen for optimal results, and that can help for further research in the field of text extraction. In addition, it supports in devising techniques for improved pre-processing, thus, improves the accuracy of text extraction. This work leverages, in optimizing the OCR engines for optimal lab report text recognition.

Due to the limited availability of processing power, experimentation is not performed on a larger dataset. However, experimenting on better computing infrastructure with larger dataset, we can observe better results which will be diverse and may offer the comparison of accuracies over the combinations between OCR engines, etc. Currently, our dataset consists of lab reports in PNG format, which can be extended to other formats. Moreover, apart from targeting on pre-processing combinations, future research could focus on the factors among the combinations that affects the accuracy for a certain OCR engine, if and only if detailed insights are available.