1 Introduction

In recent years, technological development has radically changed our relationship with audiovisual information. Images and videos have taken an important place in the flow of information that we receive every day. The way we treat this flow of information would certainly affect our daily lives. The text documents (books, printed articles, patent documents, receipts, magazine pages, etc.) are one of the forms of presentation and organization of information. Lately, taking photos of text documents in place of digitizing them with a scanner has become more and more common due to the popularity and the rapid development of smartphones. Several works have been proposed for the detection and recognition of the text contained document images. Nevertheless, most systems, generally called Optical Character Recognition (OCR) systems, have been dedicated to scanned text documents [61, 64, 65, 68] although the text document images acquired by mobile devices that represent a large part of multimedia content. The processing and recognition of this last category of document images is not a trivial task and represents a major challenge [40].

In general, the processing of an image obtained by a smartphone is affected by several factors, such as the perspective distortion that can takes place when the text image plan to capture is not parallel to the smartphone’s camera plane. The result of this distortion is that the characters located farther appear smaller and the hypothesis of parallel lines of the edges of the document page no longer fits to those in the captured image. We can also point out that mobile devices have much less control over the lighting conditions of the acquisition environment. The variation in illumination is common, due to the physical environment such as shadows or reflective surfaces and lack of controlled lighting. Complex background is also another problem that can constantly face when the scene to be captured is larger than the text region. Blur distortion is another problem that appears on document images when the smartphone’s camera is focused on the background rather than the text document, or may be caused by the movement of the Smartphone’s camera during acquisition. The processing of images obtained by smartphone is also affected by the variety of text properties from one document text to another, such as: variety of sizes, fonts, styles, colors. Figure 1 presents some examples of the problems posed after the acquisition of a document image by a mobile device. All the aforementioned challenges have formed our motivation to propose a system for text recognition in document images obtained by mobile terminals.

Fig. 1
figure 1

Challenges of document images acquired by a mobile terminal [30]. a Document scanned using a scanner. b: The same document captured by a mobile terminal

Text recognition can be defined as the ability of a smartphone or a computer to transform text data contained in an image into its equivalent symbolic representation, in the form of ASCII text. Typically, a text recognition system comprises three main phases: preprocessing, features extraction and classification [43]. The preprocessing step is generally applied to improve and enhance the quality of the document image for subsequent appropriate analysis. This step can itself be divided into smaller steps such as: text detection, perspective correction, segmentation, skew correction, normalization and so on [39]. In the preprocessing phase, a detection method takes a gray scale or color image as input and outputs an image with a rectangle surrounding the text region. The methods proposed in the literature for this task can be classified into three categories [25]: connected component-based methods [22, 62, 63], texture-based methods [16] and gradient-based methods [49, 58]. The perspective correction step makes it possible to transform the geometry of the image into a perspective different from that which was originally captured. Many works have been done to accomplish this step. kim and al [26] use a discrete representation of text-lines and text-blocks to correct image document perspectives. Castro et al. [9] propose a method that first consists in detecting the text region in the image, and therefore the perspective correction is performed using a transformation matrix. Liu et al. [31] present an approach that is based on the segmentation of text into individual lines, and then perspective correction is performed at each line of text [46]. With regard to segmentation, several approaches are reported in the literature to ensure this step. They are categorized into three strategies: top-down, bottom-up and hybrid [14]. Top-down strategies like projection profile [4], filtering techniques [6, 55] and Hough transform [46] take as input the entire image of text and attempt to divide it into different text-lines images. In terms of the bottom-up strategies, they start from a small-scale level of an image (i.e. pixels), and afterwards they use some technique like: clustering [32, 69], function analysis [2] or active contours [42], in order to locate each text line area. While other works [51, 56] are trying to combine the two strategies for segment the text image into lines. The purpose of the next step is to detect and correct the skew of each text-line image. Many methods have been developed to this end [15, 36]. Generally, these methods are based on the following idea: first, they try to detect the angle of inclination, and then the correction is performed using rotation by this angle. After skew correction, size normalization is an important task which aims to minimize the variations between the shapes of the characters and words in the different document images of a database. In the case of text line images, normalization consists in forcing each line image to take the same height while maintaining the width/height ratio of the images [23].

Extraction of representative and essential features from an input image is the main key to improving the performance of a recognition system. According to the literature [33], feature extraction methods can be divided into two groups: hand-crafted feature-based methods and automatic learning feature-based methods. In the case of the first group, we can mention the direct use of pixel intensity values as features [20, 35], number and positions of black pixels [5] and statistical features [24, 41]. In addition, we can add other methods like: the use of histogram of oriented gradients (HOG) as a descriptor [53], projections of oriented gradients [44], principal component analysis (PCA) [27], etc. However, the traditional feature extraction methods are very limited as they require prior knowledge about information on the position and relevance of features, and they are more sensitive to the variation of font sizes, styles, colors and non-uniform lighting condition of document text [33]. In the last few years, the automatic learning feature methods, particularly the deep convolutional neural network (CNN), has shown better performance compared to hand-crafted features methods in various recognition problems such as: image classification [60, 67], cancer detection [59], age estimation [10], Pedestrian Detection [29], etc. The deep convolutional neural network is an alternative that does not suffer from the disadvantages already mentioned in the previous group of methods. Their deep architecture allowed them to be able to learn automatically the most relevant features, which makes it one of the most robust systems in terms of translation, scaling and distortions. We can cite the adaptation of the CNN model as a features extraction method in several text recognition systems such as: scene text recognition [50], offline handwriting recognition [54], video text recognition [66], and historical handwriting recognition [17]. However, the only drawback of automatic learning feature-based methods versus hand-crafted feature-based methods is that they require more time in the training phase to find the most relevant features [57].

The feature vector extracted in the last phase is passed as input to an already trained classifier that predicts its class. Numerous approaches have been used to classify text in images. They are categorized into two groups: segmentation-based approach and segmentation-free approach. The approaches in the first group perform explicit segmentation of text-line image into characters before employing a classifier like MLP or SVM [57] to recognize the class of each individual character. However, the performance of these approaches is strongly related to the results of decomposition of text-line into characters. Each segmentation error directly decreases the recognition rate of the entire system. The segmentation-free approaches present an alternative to overcome this limitation. They take as input the whole text-line image as a sequence of images prepared by a sliding window, which allows recognizing the text image without performing explicit segmentation. As a result, classification models based on this group of approaches are the most used especially in the case where separations between characters are difficult to determine, such as: handwriting, complex background, overlapping characters, and so on. Furthermore, these approaches make it possible to use the contextual information, which means that the output is calculated at each time step according to the past and future contexts of a character or a word [20]. The two most frequently used models are the Hidden Markov Model (HMM) and the Recurrent Neural Network (RNN). The HMM model is used in many text recognition systems, for example: English texts [45], Arabic text and [1]. In recent years, with the fast advancement of computers and the use of GPU for calculation, the RNN model demonstrates better performance than the HMM model in the task of text classification [18]. Numerous recent works have taken advantage of the power of the RNN, for example and without limitation, recognition of French text [36], recognition of English text [50], recognition of Arabic text [34].

The main contribution of this work is to address the problem of text recognition in images obtained by a smartphone. This is realized by the proposal of a new system that consists of three main parts. In the first one, a preprocessing operation is applied in order to detect text region in document image, and then segment this region into individual text-line images. This phase takes into consideration the problems mentioned above, often encountered during the acquisition of images by mobile. Inspired by the great success achieved thanks to the automatic learning feature-based methods [50], a deep convolutional neural network CNN model is used in the second part to extract a sequence of feature vectors form each text-line image. In this way, the system will be able to automatically extract the features that best discriminate between the different classes of characters, rather than using a hand-crafted feature-based technique. The last part of the system is dedicated to the classification phase where we use a bidirectional recurrent neural network (BRNN) combined with the gated recurrent units (GRU) block [11] and the Connectionist temporal classification (CTC) layer [18]. The BRNN-CTC architecture takes as input the sequence of feature vectors and produces a sequence of characters representing the text contained in the text-line image. BRNN offers the possibility to use past and future context information in the calculation, which is useful and indispensable in the recognition of the image text sequences. Using a CTC layer eliminates the need to segment the text line image into images of individual characters. It takes as input the text line image and gives as an output the corresponding text transcription without having to perform an explicit segmentation. The proposed system is evaluated on the ICDAR2015 Smartphone document OCR dataset [7]. The main contributions of the proposed system can be summarized as follows:

  • We implement and compare different CNNs models to extract features from text-line images according to different widths of a sliding window.

  • We explore various combination between a RNN or BRNN with LSTM [21] or GRU [11] memory block to classify the sequence of feature vectors that extracted by CNN model. As a results, the BRNN-GRU architecture outperforms the BRNN-LSTM architecture and provides a better recognition rate.

  • The experimental tests of the proposed system on the ICDAR2015 Smartphone document OCR dataset [7] shows promising results compared to the state-of-the-art systems.

The remainder of this paper is structured as follows: Section 2 describes the details of each step of the proposed text recognition system, whereas Section 3 presents the experimental results. Finally, Section 4 presents conclusion and discusses future works.

2 Proposed system

As shown in the Fig. 2, the general architecture of the proposed system includes five major steps: preprocessing, baseline correction, normalization, feature extraction and classification. This section details each step of the proposed system.

Fig. 2
figure 2

Flowchart of the proposed text recognition system

2.1 Preprocessing

The preprocessing phase is essential for detecting text, eliminating noise and segmenting text into lines of text. Figure 3 shows all the operations used during the preprocessing phase. These operations will be explained in detail in the following parts.

Fig. 3
figure 3

Preprocessing operations

2.1.1 Text detection

In this part, we present the method used for the detection of the text region. The method consists of several steps, the first of which consists of converting the original image into a grayscale image (see Fig. 4b). The second step is to detect all the contours of the image, for this reason we use Canny’s algorithm [8]. Figure 4c shows the result of this algorithm.

Fig. 4
figure 4

The steps for detecting the text region. a: Initial image. b: Gray scale image. c: Contour detection. d: The dilated image. e: The filled image. f: Text region

The goal of the next step is to help distinguish the text region from the background of the image. For this reason, we apply the dilation technique. The main idea of this technique is to connect and enlarge groups of nearby pixels in an image. In our case, this will merge the characters and fill the spaces between the words to form bands corresponding to each line of text (see Fig. 4d). At this point, the lines of text are clearly visible, but there are still black areas between the lines of text and the outlines of the text region. Therefore, the next step is to fill these areas with white pixels. This is done as follows: If a black region is surrounded by white outlines, the pixels in that region will be transformed into white pixels. Figure 4e shows the regions filled after the application of this operation.

The last step is the determination of the text region from the regions obtained in the previous step. To achieve this result, we rely on the fact that the document page is the main object in all images. Thus, we count the number of pixels in each region of the image, and we choose the region that contains the largest number of pixels. Figure 4f shows the result obtained after this last step.

2.1.2 Perspective correction

During the acquisition of an image, effect like perspective distortion is inevitable. This effect can damage the image and decrease the performance of the recognition. Therefore, perspective correction is a mandatory task to improve and reduce the complexity of an image. To perform this task, we use our previous work [12]. The approach proposed is designed to quadrilateral forms, especially the form of a document page. It is divided into two phases involving the detection of the corners of the document page, and hence the use of bilinear interpolation to correct the perspectives. Figure 5 shows an example of a document image before and after perspective correction.

Fig. 5
figure 5

The perspective correction of a document image. a Initial image. b Result

2.1.3 Noise removal

After the phases of text extraction and perspective correction, the noise appears on certain areas of the image, especially at the edge of the text page.This can cause instability and errors during the segmentation and classification phases. In this work, we use two operations to solve this problem: binarization and marginal noise suppression. The operation of binarization is provided by using Sauvola algorithm. Then the projection histogram is calculated to detect the noise of the horizontal (top and bottom) and vertical (left and right) margins. In the case of the noise of the horizontal margin, we plot the horizontal projection with the calculation of the number of black pixels along each line of the binary image. From the obtained curve, it is possible to determine the location of the upper and lower zone of the marginal noise according to the location of the first and last peak respectively. Then, we look for the location where the top horizontal cut will be done, according to several experiments, we opted to choose the location between the first and the second peak. In the same way, we chose the location between the last and the penultimate peak for horizontal cutting at the bottom. To remove noise from the vertical margin, we use the same way but this time vertically by counting the number of black pixels along each column. Figure 6 illustrates the results of binarization and marginal noise suppression.

Fig. 6
figure 6

Removing the marginal noise. a Initial image. b Binarized image. c Result

2.1.4 Segmentation

In this part we are interested in the task of segmenting the text of the image into individual lines images. The main problem encountered in this step is the inclination of the lines of text with respect to the horizontal axis of the image, as shown in Fig. 7a.

Fig. 7
figure 7

Segmentation of a text image. a Initial image. b Dilated image. c Extraction of connected components. d The final result

To solve this problem, we use an algorithm that based on the idea that an image of the text can be grouped as related components separated by white spaces along each line of text [3]. The proposed algorithm works as follows: the input of the algorithm is the document image, a threshold S which designates the minimum size of a connected component and the parameter T which corresponds at the average distance which makes it possible to know if two connected components belong to the same line or not. Then, the morphological dilation is applied to extend and reinforce the connected components of text. his is very interesting because it allows to group the words in each line of text (see Fig. 7b). After the previous operation, all the CCs related components of the image are extracted (see Fig. 7c). These CCs are grouped into large R components and small SR components (The size of a CC is determined by the number of pixels it contains). afterwards, all the components of the R group are sorted in decreasing order from top to bottom, and the SR group is eliminated initially, since the small components are usually far from the main body of the words or characters, making the segmentation process more difficult to accomplish.

In the second part of the algorithm, we select the first connected component R1 which will be located in the highest part of the image, and we assign it to line 1. After that, we identify the nearest CCR2 next to CCR1. This is done as follows: assuming that (x1,y1) and (x2,y2) are the coordinates of the centroids of CCR1 and CCR2 respectively. R2 belongs to the same line of R1 only if the condition |y2y1| < T is satisfied, where T is the distance between the two CCsR1 and R1. The same process will be repeated until all the connected components are assigned to lines of text. In the last part of the algorithm, we affect each small component SRi to the line Li at which it has the shortest distance. Figure 7d illustrates the result obtained using the proposed algorithm.

2.2 Baseline correction

The text-line inclination problem is often appears after the segmentation phase, (see Fig. 8a below). This problem presents a difficulty in the classification phase in which we use a sliding window to extract features. An inclined writing can lead to an overlap and instability of characters and words within the sliding window. As a result, correcting the angle of inclination of each line of text is a necessary step before moving to the next steps.

The inclination correction of a line of text (also called skew correction), consists in straightening and putting horizontally the line of the inclined writing. In this work, we adopted the technique described in the reference [37], to correct the angle of inclination. This technique converts the text line image to an image composed of black block strips having a rectangle shape all along the text line, and then it uses these blocks to identify the candidate points to plot the approximation of the baseline (see Fig. 8b). Next, a fourth degree polynomial is used to draw the baseline approximately by fitting a curve through the extracted candidate points. Figure 8c and d below illustrate the baseline plot result with the curve fit operation.

Fig. 8
figure 8

Baseline correction of a text-line image. a: Initial image. b: Painted strips image. c: Baseline estimate. d: Baseline tracing on the initial image. e Estimation of the angle of inclination of the word “accepted”. f: Correction of inclination of the word “accepted”. g: Image results after correction of the angle of inclination

After the baseline estimate, the average inclination angle for each block is calculated using the location of the candidate points (pixels) and the location of the baseline intersection points with the block. Then each block presented in the text line image is rotated by the corresponding inclination angle in order to horizontally set the baseline of the image (see Fig. 8e and f). Figure 8g shows the result of the technique used for the inclination angle correction of the text line image.

2.3 Normalization

The size of a text image may vary from one writer to another, from one font to another and within the same font after enlarging or decreasing in size, which can cause instability in performance of recognition. Therefore, the normalization is an important task that aims to minimize variations between the shapes of characters and words in different images. In the case of text line images, the normalization phase consists in forcing the images of lines with an arbitrary dimension to have the same height. The width of the resized image is determined relative to the height of the original image; this operation is performed without affecting the height-to-width ratio of the text line image. We have observed that the normalization of all text-line images to a fixed height (32 pixels) does not have much influence on system performance. The results obtained after normalization of the size of some text-line images are shown in Fig. 9.

Fig. 9
figure 9

Normalization of text-line images size to a height of 32 pixels while maintaining the aspect ratio

2.4 Features extraction

Features extraction is the process of taking a dataset of images and building explanatory variables (Features). After that, these features will be employed in order to train a machine learning or a deep learning model for a classification problem. In particular, the features extraction step aims to obtain the most relevant features of a text line image, while avoiding the loss of important and significant information of the image, and subsequently represents this information as a of one-dimensional vector. Inspired by the system proposed in [50], we adapt the CNN model in the features extraction step in order to build a text-line classifier. The model will be able to automatically extract the features that best discriminate between the different classes of characters; this is done using a large training set of text-line images. In general, a deep CNN architecture consists of convolution layers, nonlinear layer (The ReLU layer), pooling, and fully connected layers [28]. The rest of this section gives the concept of each layer and the adaptation methodology of CNN in our work.

2.4.1 Convolutional layers

The convolution layer is the key component of CNN, and is always the first layer. Its purpose is to locate the presence of a set of features in the image received as input. This is thanks to a succession of filters also called kernels, transforming the initial image into a set of images called feature maps (or activation maps). In fact, each filter Fj of size \((W_{F_{j}},H_{F_{j}})\) is started from the top-left corner and traverses the entire input image I(WI,HI). In every location an element wise multiplication is performed between the image patch and the filter. The sum of this multiplication is then constituted one of the pixels of the feature map. In addition to the number and the size of filters, there are two common hyperparameters for convolution layer: stride and padding. Stride indicates the number of pixels with which the filter shifts horizontally or vertically on the input image. For example, a stride of one pixel means moving the filter one pixel at each time step to compute the next convolution output. The padding (Zero-padding) indicates how many rows and columns of 0 are added around the input image, which allows to control the size of the output. It is sometimes desirable to keep the same size as that of the input image. For better understanding the convolution layer, Figure 10 illustrates an example of the convolution operation. In the example, an image I of a size of (5,5) (see Fig. 10a) is used with a padding of 1 pixel (see Fig. 10b). The image is convoluted with a filter F of a size of (3,3) (see Fig. 10c). Figure (see Fig. 10d) shows the convolution product calculation between the filter I and each portion of the image I. The filter traverses the entire image with a stride of 1 pixel. The result at each step is one of the elements of the feature map matrix.

Fig. 10
figure 10

Example of the convolution operation. a: Initial image. b: Image with padding of 1 pixel. c: Filter. d: The convolution between the filter I and each part of the image with a stride of 1 pixel

2.4.2 Nonlinear layer

After each convolution layer, a rectified Linear Units (ReLU) layer is added. It goal is to introduce and increases non-linearity in CNN by applying an activation function defined as follows: f(x) = max(0,x) for each pixel in the feature map. The concept of this function is fairly simple: every time there is a negative value in a pixel, it is replaced by a 0. ReLU has been widely used due to its ability to learn fast and works better compared to traditional non-linear functions (tanh or sigmoidal) [38]. An example of using the function ReLU is shown in Fig. 11.

Fig. 11
figure 11

Functionality of the Rectified Linear Units layer. a: Feature map. b: Rectified feature Map

2.4.3 Max-pooling layers

Another very powerful tool used by CNNs is called Pooling. It’s a layer often placed between two convolution layers designed to take a large image and reduce its size while preserving the most important information it contains. This serves two purposes; first it makes the CNN less sensitive to the position of features and robust against noise and distortion. Second, it improves the efficiency of the CNN and avoids overfitting [28]. There are different manners to do pooling, but in practice the most effective is max pooling [47]. It consists of taking a kernel and passing it over the rectified feature map as in convolution, but this time we take the maximum value in every image patch that the kernel convolves around. In practice, a kernel of (2 × 2) pixels is often used with a value of 2 pixels for the stride and without any padding. Figure 12 illustrates an example of the Max Pooling application on the rectified feature map that has been obtained in the previous layer.

Fig. 12
figure 12

Example of a max-pooling operation with a kernel of (2 × 2) pixels, a stride of 2 pixels and without any padding. a: Rectified feature Map. b: Result

2.4.4 Fully connected layerss

After several layers of convolution and max-pooling, the last layer is called fully connected layer. Each neuron in this layer is connected to all the neurons of the previous layer. This layer gathers all the information processed separately in the previous layers, and provides the final result (vector of probabilities) using a sigmoid or softmax activation function. The size of the result vector is equal to the number of classes of classification.

2.4.5 Methodology adaptation

In this paper, we directly extract the features from the text-line image based on deep convolutional neural networks. Before that, a sliding window approach vertically scans the text line image from left to right using a fixed-length window; this will allow dividing the image into sub images, called frames. The use of this approach requires adjusting two parameters: the shift s and the size of the windows (W × H) (see Fig. 13). In fact, we use a shift of 4 pixels and the height H of the window is corresponding to the height of the image (32 pixels). After decomposing the images into a set of overlapping frames, the features of each frame (2D) are extracted using a CNN model in order to form the corresponding feature vector (1D). A visualization of the sliding window approach and the CNN model is illustrated in Fig. 13.

Fig. 13
figure 13

Representation of the sliding window approach and the CNN architecture that includes five convolutional layers and five max pooling layers

In this work, we implement and compare three different CNNs models applied to a different width. These models are built according to the size of three frames: 4 × 32, 8 × 32 and 16 × 32, and are referred to hereinafter as CNN4×32, CNN8×32 and CNN16×32. Table 1 presents the architecture of each of these modules. The three CNNs are composed of 5 convolutional layers and 5 max pooling layers (the fully connected layer is removed). Every convolution filter has a filter size of 3 × 3 pixels with stride of 1 × 1 pixels and a zero padding of 1 × 1 pixels. The number of feature maps in these layers is 64, 128, 256, 512 and 512. To introduce non-linearity, a ReLU activation function is applied after each convolution layer. A rectangular kernel of 1 × 2 pixels is applied at the 1st, 2nd and 3rd max-pooling layers for the CNN4×32 model, at the 1st and 2nd max-pooling layer for the CNN8×32 model and only at the 1st max-pooling layer for the CNN16×32 model. A square kernel of 2 × 2 pixels is used for the remaining max-pooling layers. All the max-pooling layers used a stride of 1 × 2 or 2 × 2 pixels without padding. Finally, the CNN model generates 1 × 1 × 512 feature maps, which will be concatenated into a single vector. This later contains the features extracted from the input frame. To avoid the training problems of the deep convolutional layers, the batch normalization [61] technique is used. Two batch normalization layers are added after the 3rd and 4th convolutional layers respectively. This technique speeds up the training process and reduces overfitting. This operation will be repeated for the remaining frames. At the end of this step, a sequence of feature vectors will represent each text line image, which is the input of the recurrent neural network employed in the classification step.

Table 1 The structure of each CNN model. The output, maps, F, K, S and P are respectively: the size of feature map, the feature maps number, filter, kernel, stride and padding

2.5 Classification

The classification and recognition phase is often the last step, which involves assigning a label, or class, to the features vector. Generally, in recognition systems the classification process requires a preliminary training stage, also called learning, where the classifier learns from the training samples, allowing it in the classification phase to predict classes of new samples. In this work, the sequence of feature vectors obtained in the previous step is used in the classification process. The latter makes it possible to assign the appropriate transcription to the sequence of vectors. In this paper, we propose an architecture that consists of a bidirectional recurrent neural network (BRNN) with the gated recurrent units (GRU) block [11] and the Connectionist temporal classification (CTC) layer [18]. This architecture will allow to build a model that can learn how to classify each sequence of vectors.

The Recurrent Neural Network (RNN) is a deep learning architecture built from artificial neural networks originally proposed by J.L. Elman [13]. RNN differs from the classical neural network because it includes cyclic connections, which can model contextual information of a sequence dynamically. The output of an RNN is calculated using both the input at the current instant t and the output of the hidden layer at the instant t − 1. Specifically, consider an RNN with X input neurons, H hidden neurons, and Y output neurons. At a time t, the state of the hidden layer h(t− 1) at the time t − 1 and the input xt at time t are passed to an activation function in order to calculate the state of the hidden layer ht at time t. The following equation formalizes this function:

$$ \begin{array}{@{}rcl@{}} h_{t} = f(w_{hh}h_{t-1}+w_{xh}x_{t}+b) \end{array} $$
(1)

Where f(.)is a nonlinear activation function (Sigmoid, Tanh, Relu, …), whh is the weight matrix that links the hidden layer at time t − 1 and the hidden layer at time t, and wxh is the weight matrix that links the input layer with the hidden layer at time t, and b is the bias vector of the hidden layer at time t. Then, the output yt at time t is calculated as follows:

$$ \begin{array}{@{}rcl@{}} y_{t} = f(w_{hy}h_{t}+b) \end{array} $$
(2)

Where why correspond to the hidden-to-output weights matrix. One of the ideas introduced in the literature to improve the amount of context information is to exploit at time t the time dependencies of both the past and the future. Indeed, this is done using an RNN model with two hidden layers; one traverses the input sequence from left to right, while the other runs it in the opposite direction. This model is known as bidirectional recurrent neural network BRNN [48]. The states of the hidden layer and the output layer in this model are calculated as follows:

$$ \begin{array}{@{}rcl@{}} {h_{t}^{b}} &=& f(w_{h^{b}h^{b}}h_{t-1}^{b}+w_{xh^{b}}x_{t}+b) \end{array} $$
(3)
$$ \begin{array}{@{}rcl@{}} {h_{t}^{f}} &=& f(w_{h^{f}h^{f}}h_{t+1}^{f}+w_{xh^{f}}x_{t}+b) \end{array} $$
(4)
$$ \begin{array}{@{}rcl@{}} y_{t} &=& f(w_{h^{b}h}h_{t-1}^{b}+w_{xh^{f}}{x_{t}^{f}}+b) \end{array} $$
(5)

Where \({h_{t}^{p}}\) and \({h_{t}^{f}}\) respectively represent the hidden layers in the backwards forwards directions.

The BRNN is trained in the same way as a classical neural network, with the use of the backpropagation algorithm. However, whatever the learning algorithm used, the main disadvantage of a BRNN is the disappearance of the gradient [19] (Vanishing Problem). Indeed, we see a very rapid decrease of the gradient during the retro-propagation, which makes learning long-term dependencies very difficult. One solution proposed to avoid this problem is to add the memory block mechanism to the BRNN. This block is positioned at the hidden layer, and includes one or more memory units that give the network the ability to memorize or forget about long-term or short-term information. Two types of block are found in the literature: the first is LSTM (Long Short-Term Memory) [21] and the second is GRU (Gated Recurrent Unit) [11].

The LSTM block consists of a memory cell for storing information and three control gates: an input gate, an output gate and a forget gate (see Fig. 14a). The input gate controls the importance of the state of the current input. A state is considered relevant if this gate gives a value close to 1. The importance of the previous state on the current state of the memory cell is controlled by the forget gate. The output gate controls the importance of the current state on the rest of the network (higher layers and next time steps). Generally, LSTM introduces a linear dependence between the memory cell ct and its previous c(t− 1) at each hidden layer in order to control the information flow in the network. The operations of a RNN with LSTM block are written as follows:

Fig. 14
figure 14

Visualization of Long short-term memory (LSTM) and Gated Recurrent Unit (GRU) architecture [11]. a: LSTM gates: input (I), forget (f) and output (o). b: GRU gates: update (z) and reset (r)

$$ \begin{array}{@{}rcl@{}} i_{t} &=& \sigma(w_{hi}h_{t-1}+w_{xi}x_{t}+b_{i}) \end{array} $$
(6)
$$ \begin{array}{@{}rcl@{}} f_{t} &=& \sigma(w_{hf}h_{t-1}+w_{xf}x_{t}+b_{f}) \end{array} $$
(7)
$$ \begin{array}{@{}rcl@{}} o_{t} &=& \sigma(w_{ho}h_{t-1}+w_{xo}x_{t}+b_{o}) \end{array} $$
(8)
$$ \begin{array}{@{}rcl@{}} g_{t} &=& \tanh(w_{xg}x_{t}+w_{hg}h_{t-1}+b_{g}) \end{array} $$
(9)
$$ \begin{array}{@{}rcl@{}} c_{t} &=& f_{t} \odot c_{t-1} + i_{t} \odot \tanh(w_{hh}h_{t-1} + w_{xh}x_{t}+b_{c}) \end{array} $$
(10)
$$ \begin{array}{@{}rcl@{}} h_{t} &=& o_{t} \odot \tanh(c_{t}) \end{array} $$
(11)

Where it, ft , ot and ht respectively denote the values of input gate, output gate, forget gate and output hidden layer state at time t. The symbol ⊙ represents the element-wise product of vectors. w denote the weight matrices connecting different gates and layers, and b denote the corresponding bias vectors.

The GRU [11] block has recently been proposed as a simpler alternative to the LSTM unit, while sharing the same objective of avoiding long-term dependency problems. It contains two gates that make it possible to estimate the flow of information inside the block, but, it does not contain a separate memory cell. The gates of a GRU are: the Update Gate and Reset Gate (see Fig. 14b). The update gate role is similar to the role of the LSTM’s forget gate, it controls the importance of the previous state on the current state of the network, while, the reset gate allows the GRU block to ignore information that is not relevant in the next time steps. The GRU block is usually defined by the following equations:

$$ \begin{array}{@{}rcl@{}} z_{t} &=& \sigma(w_{hz}h_{t-1}+w_{xz}x_{t}+b_{z}) \end{array} $$
(12)
$$ \begin{array}{@{}rcl@{}} r_{t} &=& \sigma(w_{hr}h_{t-1}+w_{xr}x_{t}+b_{r}) \end{array} $$
(13)
$$ \begin{array}{@{}rcl@{}} \overset{\sim}{h}_{t} &=& \tanh(w_{xh}x_{t}+w_{hh}(r_{t} \odot h_{t-1})+b_{h}) \end{array} $$
(14)
$$ \begin{array}{@{}rcl@{}} h_{t} &=& z_{t} \odot h_{t-1} + (1-z_{t}) \odot \overset{\sim}{h}_{t} \end{array} $$
(15)

Where zt, rt, \(\overset {\sim }{h}_{t}\) and ht are respectively the update gate, the update gate, the current block state and the output hidden layer state.

A typically RNN contains a very large number of parameters; the bad adjustment of these parameters can drive to overfitting. This means that the network work effectively on the training data, but does not generalize well on the testing data. To remedy this problem, the dropout layer has been introduced [52]. This layer is used during learning. At each iteration, a certain percentage of neurons of a layer are randomly disabled to artificially reduce the number of parameters. This way allows the RNN to learn more generalized parameters that do not focus on the details of the learning dataset. Once the learning is complete, all the neurons are reactivated.

Even if an RNN with a GRU or LSTM memory block is able to model long-term dependencies, it requires a pre-segmented training data to allow the network to learn to provide the correct output at every time step. Therefore, the input feature vector xt at each instant t has to be assigned to the corresponding output target (corresponding character class). However, in the case of text recognition, no character segmentation is performed, and the feature sequences that are extracted with a sliding window are not separated. Gravex et al. [18] provided a solution to this problem by introducing a specific layer, termed connectionist temporal classification (CTC) layer, which extends the use of an RNN for the case of an unsegmented data sequence. The essential role of the CTC layer is to calculate at each time-step the posterior probability of an output (character class) for each unsegmented input sequence. It only requires presenting the feature vectors sequence side by side the target sequence of characters during the training phase.

Figure 15 shows the general architecture used in the classification step. It takes as input the text line image and gives as an output the corresponding text transcription without having to perform an explicit segmentation

Fig. 15
figure 15

Figure 7: General overview of the CNN-BRNN architecture. The architecture includes three main parts: 1) CNN model that extracts a sequence of feature vectors from the input text-line image; 2) BRNN network (input, hidden and output layers), which predicts the corresponding class of each feature vector at each time step; 3) CTC layer, which transforms the sequence of output predictions into the final sequence of characters

3 Experimental results

This section is devoted to presenting the results. We start by giving an overview of the database and the evaluation measures used. Next, we study the impact of different parameters and configurations on the results of the recognition. Finally, we compare the results obtained with those obtained by other existing systems on the same database.

3.1 Datasets and evaluation measures

In order to evaluate the performance of the proposed system, we perform the experiments on the ICDAR2015 Smartphone document OCR dataset [7]. This database contains 12100 images, which are separated into two sub-bases: the sample database (3630 images) and the test database (8470). The text format is grouped as a single column with a font and a particular size in each image of the database. The images are acquired by Two smartphones (Samsung Galaxy S4 and Nokia Lumia 920) in diverse conditions of lighting, blur and perspective. In our study, the training phase is carried out on the Sample dataset, which have a total of 162186 lines after segmentation, while the testing phase is performed using test set (335740 lines).

All our experiments were performed on an Ubuntu 16.04 LTS (64 bit) operating system installed on a computer with Intel Core i7 − 4770 CPU, 16 Go RAM and NVIDIA GeForce GTX 980 4GO. Matlab r2014 environment is used to implement the steps of preprocessing, baseline correction and normalization. In respect of the features extraction and classification step, we have used the powerful deep learning framework Torch7.

The recognition accuracy percentage (CRA %) metric is used to evaluate the performance of the proposed system. The CRA is determined using the editing distance (Levenshtein Distance). This distance is estimated by counting the minimum number of characters that can be inserted, deleted, and substituted to correct the result text of the recognition phase and have the same text of the corresponding terrain truth. Therefore, the CRA is calculated as follows:

$$ \begin{array}{@{}rcl@{}} CRA &=&\frac{n-ED}{n} \times 100 \end{array} $$
(16)
$$ \begin{array}{@{}rcl@{}} Edit \ distance &=& ED = I + D + S \end{array} $$
(17)

Where n represent the total number of characters in the ground truth text, I, D and S are respectively the numbers of characters inserted, deleted and substituted in the output sequence of text of the proposed system.

3.2 Parameter Setup

The optimal performance of a system implies a rigorous selection of parameters. In our case, four parameters are to be adjusted to obtain optimal values. The first parameter refers to the CNN model used in the features extraction step. The following parameter is related to the architecture of RNN (unidirectional or bidirectional) used in the classification step. The other two are related to the type of the memory block (GRU or LSTM) and the number of memory units in the hidden layer of RNN.

The first evaluation focuses on the two first parameters. Indeed, we have evaluated and compared three CNN models: CNN4×32, CNN8×32 and CNN16×32. The three models are used in the features extraction phase, and each one of the three was combined in the classification stage with either a unidirectional recurrent neural network RNN or a bidirectional recurrent neural network BRNN. Therefore, we will have 6 architectures that will be evaluated on the training and testing datasets. Results obtained by different architectures are presented in Table 2. On the basis of these results, we can notice that the model CNN8×32 has achieved the best performance compared to the models CNN4×32 and CNN16×32. More precisely, the first model CNN4×32 combined with RNN or BRNN has achieved a CRA of 78.16% and 84.96% respectively. The second model CNN8×32 combined with RNN or BRNN has reached a CRA of 88.03% and 94.49% respectively. While the third model (CNN16×32) that has combined with RNN or BRNN has achieved a CRA of 87.21% and 92.13% respectively. These results clearly demonstrate that applying a deep convolutional neural network on a frame with a size of 4 × 32 pixels not perform better. This is due to the fact that this frame is small in size and poor in content, which will generate features that wouldn’t aid to discriminate correctly the classes of characters in the classification step. We can also note that increasing the size of a frame will not necessarily improve significantly the CRA, and can demand a large amount of computation. Regarding the classification step, the results prove that unidirectional RNN perform worse than bidirectional BRNN. This confirms that using the past context side by side with the future context is important to predict the appropriate transcription of the input sequence of feature vectors. Therefore, we decide to combine the CNN8×32 model with BRNN architecture to carry out the next experiments.

Table 2 Comparison of character recognition accuracy with different CNN models and RNN architectures

The main purpose of the second experiment is to study the influence of choice of the memory block (GRU or LSTM) and the number of memory units in the hidden layer (size of the hidden layer) on both the character recognition accuracy and the time consuming during the training phase. For this reason, we conducted a series of four tests with two models of a BRNN; one model contains the GRU memory block, and the other LSTM. For each model, we test various hidden layer sizes: 100, 150, 200 and 250. These sizes are chosen on the one hand to minimize the recognition error rate of a BRNNGRU or BRNNLSTM, and on the other hand to find the optimal size.

Figure 16 illustrates the comparison of the recognition error rate curves of the BRNNGRU and BRNNLSTM models with different sizes. The training time after 20000 iterations with respect to the size of the hidden layer is shown in Fig. 17.

Fig. 16
figure 16

Recognition error curves as a function of the type of memory block and the size of hidden layer

Fig. 17
figure 17

Training time as a function of the type of memory block and the size of hidden layer

From the results, it can be noted that the type and number of the memory blocks in the hidden layer affect the recognition accuracy in the different tests, and it improves with the increase of the size of the hidden layer. Indeed, we can deduce from Figs. 16 and 17 two remarks; first, the BRNNGRU model outperforms the BRNNLSTM model and achieves better results in recognition rate. Second, the GRU block converges faster than the LSTM block and makes it possible to obtain a high recognition rate with less iteration. In addition, we note that increasing the size of the hidden layer decreases the error rate, but at the same time, this will lead to increasing the time required for the training phase.

Table 3 shows the results of all experiments on the Sample dataset after 20000 iterations. The best result is achieved by the model BRNNGRU that contains a hidden layer of size 250 units with a CRA of 97.28%, without any significant improvement with more units. Therefore, we decided to choose the memory block GRU and 200 as the optimal size of the hidden layer for the rest of this work.

Table 3 Character recognition accuracy on training and test dataset using CNN8x32 BRNN-LSTM or CNN8x32 BRNN-GRU architecture with different hidden layer sizes

3.3 Comparison with Existing Systems

In this experiment, we evaluate the capacity and robustness of the proposed system by comparing it to the results of the groups that participated in the ICDAR2015 competition of smartphone documents OCR [7]. We conducted this comparison by using the same CNN8×32BRNNGRU system that was evaluated in the previous experiments. The character accuracy obtained by the proposed system as well as the participated systems in the ICDAR2015 challenge, are presented Table 4.

Table 4 Comparison of the proposed architecture and existing methods in terms of character accuracy
Table 5 Successful examples. For each example, we show the input text-line image from the original document image (input), the result after the preprocessing stage (Pre-p.), the output after the classification stage (output) and the ground truth (G.T.)
Table 6 Failure examples. For each example, we show the input text-line image from the original document image (input), the result after the preprocessing stage (Pre-p.), the output after the classification stage (output) and the ground truth (G.T.)

The CCC system uses a RNNLSTM model that was trained on both sharp and blurry text-line image. A RNNLSTM model that trained only on binary text line image has been employed in the A2iA method. The methods proposed by LDRE and Digiform use the Abbyy Finereader Engine, whereas CartPerk use the Tesseract OCR Engine to perform the recognition. All the above-mentioned methods perform the recognition task after applying the preprocessing stage, while the Finereader method carry out the recognition with Abbyy Finereader Engine 11 without applying any pre- or post-processing method. All method use non-learned features that are extracted using hand-crafted feature-based methods.

We notice from the results of all participated methods that the CCC method provides the best character rate with 99.93% compared to our method which reached 96,76%. The high rate of the CCC method can be explained by the fact that the RNNLSTM employed is trained use a data augmentation technique. Which mean that the classifier is trained using both the images before and after the preprocessing step. However, as we mentioned earlier the a high computational cost will be required to train the LSTM in addition, it has trained more times with two types of images and thus would make it computationally very demanding. The proposed method is more precise compared to the other methods that use an OCR engine tools (LRDE, Digiform, CartPerk, Finereader) and also the A2iA method that perform the recognition with a RNNLSTM model. In addition, we show the advantage of using the learned features that are extracted by a CNN model. CNN extracts more efficiently and automatically provides the most relevant features thanks to its deep architecture. It achieves better performance than other methods which incorporate hand-crafted and non-learned features.

Tables 5 and 6 illustrate some examples of preprocessing and recognition results when the proposed system is applied to various input text-line images. In Table 5, the text contained in the image-lines has been successfully preprocessed and recognized in different cases of sizes, colors, styles, lights, perspective distortions and presence of skewed or slanted degradation. In Table 6, the input images are affected by focus or motion blur because of unfocused camera. Consequently, the recognition errors are occurred due to degraded text quality.

4 Conclusion

In this paper, we have presented a novel system based on a combination of Deep Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) in order to build a text recognition architecture in the document images obtained by a smartphone. The system begins with a preprocessing step that detects and prepares the text-lines images. Then, different CNN models were explored to extract discriminative features from each text-line image. Moreover, we utilize a BRNN that integrated a GRU or LSTM memory block and compare the character recognition accuracy results with the various size of hidden layer. The experiments indicate that the CNN8×32 performs better when combined with the BRNNGRU architecture and achieves a high recognition rate with less iteration and computation time.

In the future work, we would like to improve the recognition results by adding more efficient methods to address the problem of blur distortion. We are also planning to propose a language model and a dictionary, which will be integrated into the post-processing phase and will lead to increased recognition rates as well.