1 Introduction

Arabic is a very important international language. The number of Arabic speakers exceeds half a billion (native and non-native) around the world. Arabic is the second most used after the Latin script [67].

Arabic has 28 alphabetical letters, all representing consonants, and are written from right to left [62]. Twenty-two of these letters are those of the Semitic alphabet from which it descended [44]. They are modified only in the letterform. The remaining six letters represent sounds not used in the languages written in the earlier alphabet [100].

The shape of each Arabic letter depends on its position in a word: initial, middle, end, or isolated as shown in Table 1. There are no uppercase nor lowercase distinctions as it is in the Latin characters [125]. Arabic is written cursively both typed and printed [85]. This means, intuitively, the Arabic recognition system needs letter segmentation [43].

Table 1 Arabic alphabet characters with different positions

Recognizing Arabic writings opens the area of recognizing similar languages that are written with the same script. These similar languages are fourteen including Persian, Urdu, and Pashto [14].

In recent years, there is much interest in the area of handwritten document recognition, especially in Arabic. Among the handwritten and printed documents, the automatic handwritten document recognition is more challenging [35].

Handwritten characters that are written by different persons are not identical and vary in both sizes and shapes. These numerous variations in the writing styles of the individual characters make the recognition process more difficult and challenging [7].

The similarities among distinct character shapes, the overlaps, the ligatures [38], and the interconnections of the neighboring characters [103] further complicate that recognition task.

Little research progress has been achieved compared to what has been done with Latin and Chinese languages [3, 5]. Research in Arabic optical character recognition is still a wide-open area [73].

Arabic texts and their segmentation and recognition challenges [19] can be summarized as follows:

  • Arabic text is written typed and printed cursively in blocks of interconnected characters. The cursive nature of the Arabic characters and different forms of Arabic characters makes it more difficult to achieve high accuracy in character classification and recognition.

  • One or more blocks can be used to form an Arabic word. This leads to the necessity of an adaptive Arabic text segmentation algorithm.

  • Arabic characters in addition to their isolated form can take different shapes depending on their position inside the text: initial, middle, end, or isolated as shown for the letter “Ayn” in Table 1.

  • Some characters can be written vertically or horizontally which may lead the character blocks to have more than one baseline or different baselines’ locations as shown in Fig. 1. This prevents the simple adaptations of the published baseline detection methods related to the English or Chinese languages.

  • Arabic has diacritical marks that must be dealt with [55].

  • Arabic uses different types of external objects such as dots (single, double, or triple), “Hamza”, and “Madda” [39].

  • The same character can have substantial length variation in each different text as shown in Fig. 2.

  • Handwritten characters vary in both sizes and shapes and some combinations of the Arabic characters have unique forms.

  • There are not enough public and available Arabic handwritten text and characters datasets to work on compared to the Latin language.

Fig. 1
figure 1

Four shapes of five arabic letters “Alif”, “Baa”, “Taa”, “Tha”, and “Jim” with baselines

Fig. 2
figure 2

Three different shapes of the word “Mohammed ()”

Pattern recognition is a branch that covers various fields such as face recognition, fingerprint recognition, image recognition, citywide traffic crowd flows, character recognition, and numerals recognition [10, 11, 61].

Arabic handwritten character recognition (AHCR) system is an intelligent system that can classify and recognize handwritten characters similar to humans [94]. The input to the AHCR system is the text required to be recognized and the output is the corresponding recognized text in the digital form. Figure 3 shows sample input and output of an AHCR system.

Fig. 3
figure 3

Sample input and output of the AHCR system

There have been different machine learning methods and techniques that are used for offline handwritten characters classification and recognition [8] such as support vector machines [16, 110] and K-nearest neighbor [66]. The traditional machine learning methods extract the features manually using feature engineering [63]. After that, classification algorithms are used to classify the characters based on the extracted features.

Deep learning (DL) is a trending research machine learning branch [46] that includes different algorithms and architectures such as: Deep Neural Networks [30], Convolutional Neural Networks (CNNs) [70], Deep Belief Networks [56], Autoencoders [21], and Recurrent Neural Networks [108]. They are capable of extracting the features automatically [78].

Deep neural networks consist of an input layer, multiple non-linear hidden layers, and an output layer. The number of connections and trainable parameters can become very large and the network does not fit in the device memory especially with images and videos [83]. The deep neural network needs also a very large dataset to avoid overfitting [31].

A CNN has a comparatively smaller set of parameters (i.e. weights) and easier to train [77]. It is a multi-layer feed-forward neural network that extracts salient features and properties from the input data automatically. It uses the neural network back-propagation algorithm [57] in its training.

A CNN adds the new dimension for image classification systems and recognizing visual patterns directly from pixel images with minimal pre-processing [4]. Besides, It automatically provides some degree of translation invariance [29]. It can learn from high-dimensional complex inputs, non-linear mappings from a very large number of records [80].

Another major advantage of CNN is the usage of the shared weights in the convolutional layers. This means that the same filter is used for each input in the layer [76]. The shared weights reduce the parameter numbers and improve the overall performance [18]. Recently, CNN is found more efficient for handwritten character recognition due to its distinct features [112].

Instead of learning the CNN from scratch, the transfer learning (TF) approach can be used. Transfer learning is a machine learning approach that focuses on the gained knowledge and experience storage while working on a task or a problem [93]. It can be applied to different related problems.

This TF approach is engaged in deep learning by working on these stored pre-trained models as the starting points. This allows rapid progress and performance improvement [79].

The pre-trained model can be achieved by two common paths (1) choosing a related predictive modeling task with an abundance of data where there are relationships in the input and output data or (2) selecting a pre-trained source model from the available ones released by companies, research institutions, universities, etc. [137].

There are many available pre-trained CNN models such as VGG16 [113], ResNet [52], MobileNet [58], Xception [27], NASNet [138], and DenseNet [59]. These pre-trained models were trained on the ImageNet [33]. It is an very large scale hierarchical image database (over 14 million images). It was designed for visual object recognition software research and computer vision tasks [102].

Selecting between the different hyperparameters (e.g. batch size, parameters initializers, and activation functions) used in the training of CNN models can lead to a non-deterministic polynomial-time hard (NP-hard) problem [49].

Determining the optimal set of training hyperparameters to achieve the maximum performance can be considered as an optimization problem that can be solved using soft computing algorithms such as genetic algorithms (GAs) and fuzzy logic systems [99].

Hence, the current study presents an AHCR system for text segmentation and recognition. The recognition uses CNNs and TF for the parameters optimizations and uses GA for the hyperparameters optimizations.

The main contributions of the current study can be summarized as follows:

  • Discussing the Arabic handwritten character recognition system.

  • Presenting the text segmentation challenges and suggesting a set of solutions for each challenge.

  • Suggesting different native CNN architectures for the problem of AHCR.

  • Applying transfer learning for the AHCR problem using different pre-trained CNN models.

  • Proposing a TF and genetic algorithm approach, HMB-AHCR-DLGA, to optimize the AHCR training parameters and hyperparameters.

  • Performing different experiments and reporting the best architectures using different performance metrics. Reporting the effectiveness of the best ones.

The rest of the paper is organized as follows: Section 2 presents the related works. Section 3 presents the problem formulation, discusses the text segmentation task, and outlines the different challenges with suggested solutions, presents the classification task using native and transfer learning approaches, and proposes the TF and GA approach to optimize the AHCR training parameters and hyperparameters. Section 4 presents the experiments, results, and discussions, and finally, Section 5 concludes the paper and presents future work.

2 Related work

According to the available and public publications, many researchers addressed the classification and recognition of letters including Arabic. On the other hand, there are a smaller number of methods and approaches used for the Arabic language that had been explored for recognizing the individual characters. Hence, Arabic optical character recognition is less common compared to English, French, and Chinese [12].

Various methods and approaches were presented and high recognition rates were achieved and reported for the handwritten recognition of English [18, 132], Chinese [51, 126, 131, 135, 136], Hangul [40], Malayalam [87], Devanagari [2, 114], and Telugu [116] using CNN.

Shams et al. [111] presented an algorithm for recognizing Arabic characters based on deep CNN and support vector machine. They used the correct classification rate (CRR) and error classification rate (ECR). Their proposed system used a K-means clustering algorithm to check the Arabic characters’ similarity. They proposed a sample dataset. They applied 840 tested images from 3 persons where each one wrote each character 10 times. Their reported rates were 95.07% CRR and 4.93% ECR.

Altwaijry et al. [13] presented a new dataset (named “Hijja”) of Arabic letters written that were compiled by children aged 7 to 12. It contained 47,434 characters written by 591 participants. They proposed an automatic handwriting recognition architecture based on CNN. They trained their model on “Hijja” and “AHCD” [36] datasets. They reported accuracies of 97% and 88% on the “AHCD” and “Hijja” datasets respectively.

Kef et al. [68] presented a Fuzzy ARTMAP [48] neural network. The fuzzy ARTMAP neural network is an incremental supervised learning classifier. Five Fuzzy ARTMAP neural networks were used to classify characters using the “IFN/ENIT” database [95]. The “IFN/ENIT” database contains 3,840 handwritten character images (2,304 for training and 1,536 for testing). The reported recognition rate was 93.80%.

ElAdel et al. [37] presented a Neural Network architecture based on the Fast Wavelet Transform, Multi-Resolution Analysis, and Adaboost algorithm. The learning and testing of the Arabic handwriting character classification system were conducted using the “IESK-arDB” [41] dataset which includes 6,000 segmented characters. The classification rate for the different groups of characters was 93.92%.

Elleuch et al. [40] introduced an Arabic handwritten character’s recognition using Deep Belief Neural Networks. The approach was tested on the “HACDB” database [74]. The “HACDB” database contains 6,600 shapes of handwritten characters written by 50 persons. The dataset is divided into a training set of 5,280 images and a test set of 1,320 images. The classification result was promising with an error rate of 2.1% (97.90% accuracy).

Lawgali et al. [75] proposed a framework for the recognition of handwritten Arabic words based on segmentation. It involved two phases: training and testing phases. In the training phase, They trained Arabic handwritten characters to be recognized in the next phase, while in the testing phase, they segmented words into characters for recognition. They used the IFN/ENIT database [95]. Their reported accuracy was 90.73% based on their used segmentation method.

Mozaffari et al. [87] proposed a method for the isolated Farsi/Arabic handwritten numerals recognition. They used quad tree-based fractal representation and iterated function systems. The reported recognition rate was 92.60% and was obtained on their numeral database. Their database contained 480 samples per digit and was gathered from nearly 200 people. They had different ages and educational backgrounds.

Al-Shaher et al. [6] presented training point distribution models using the expectation-maximization algorithm. They designed an optical character recognition system and trained their system using eight Arabic letters. They used a small database tested on 7-character classes. They used 100 samples of each character and reported a successful recognition rate of 98.30%.

Amin et al. [15] used conventional machine learning methods. They used rule-based stroke types to extract the relationships automatically. Their inductive learning program generated first-order Horn clauses for characters. The system was tested on their sample handwritten characters that were compiled from several individuals. It contained 120 characters of 40 samples (30 for training and 10 for testing). Their reported accuracy was 86.65%.

Abuhaiba et al. [1] presented a text recognition system that was capable of recognizing offline handwritten Arabic cursive text. Their system was tested against the handwritings of 20 subjects. Their reported accuracies were 55.40% and 51.10% for sub-word and character recognition rates respectively.

Table 2 summarizes the presented related works and studies. The table is ordered from the latest to the oldest.

Table 2 Character recognition systems related works and studies summarization

3 Problem formulation and suggested approach

Arabic handwritten character recognition (AHCR) system accepts the input scanned physical image that contains the Arabic text and outputs the corresponding digital form of the text. The AHCR system consists of (1) image acquisition phase, (2) image pre-processing phase, (3) image segmentation phase, (4) classification phase, and (5) output phase [9] as shown in Fig. 4.

Fig. 4
figure 4

The phases of an AHCR system

The image acquisition phase converts the image from the physical form into a digital one. There are different image acquisition tools and equipment such as scanners and cameras.

The image pre-processing phase targets to refine the image globally and improves its quality for the next phases [42, 117]. There are many pre-processing techniques such as noise removal [86], grayscale conversion, and slant and skew correction [26, 115].

The image segmentation phase performs three internal sub-phases. They are (1) page-to-lines segmentation, (2) line-to-words segmentation, and (3) word-to-characters segmentation [32, 92].

An Arabic handwritten recognition system can depend only on the first two sub-phases if the classifier (in the fourth phase) works on the words instead of the characters.

The classification phase accepts each character’s sub-image and outputs the corresponding digital format of that character by passing it through the trained model.

Finally, the output phase presents the result to the user by merging the characters into words, words into lines, and lines into a page concerning their order from the image segmentation phase.

3.1 Pre-processing phase

The pre-processing phase is essential in enhancing the quality of the images (in our case). They are numerous techniques that can be used including grayscale conversion, binary conversion, noise removal, slant and skew correction, and histogram equalization.

The grayscale conversion is a technique that converts from the color format into the grayscale format. There are different methods such as average and weighted method (i.e. luminosity method) [105].

The binary conversion is a technique that converts the image into a monochromatic black and white image (i.e. bi-level or two-level) [72].

The slant and skew correction technique is used to deskew the image to return it to its original orientation. Several studies addressed different solutions and reviews that target that problem [17, 65, 69].

The noise removal (i.e. image denoising) technique targets the removal or reduction of the unrequired elements from the image. There are numerous methods such as linear, min, max, median, Wiener, Gaussian, and Guided filtering [130].

3.2 Text segmentation phase

The text segmentation phase accepts a textual image and extracts the characters from it. The text segmentation phase passes through three sub-phases: (1) text-to-lines segmentation, (2) lines-to-words segmentation, and finally, (3) words-to-characters segmentation. The output of this phase should be the Arabic letters that should be passed after that to the recognition phase.

3.2.1 Segmentation challenges

There are different challenges associated with the text segmentation phase. Some of them can be handled and refined in the pre-processing phase such as noise removal and the others require additional processing [90]. They can be presented as follows:

  • Differing Acquisition Methods: The image file format is not specified and the phase is required to be able to handle the different image formats from different sources such as scanners, digital cameras, and mobile cameras.

  • Different Encodings: The phase should not be affected by the information and lost details by the different lossy and lossless encoding methods [124].

  • Varying Lighting: The image acquisition environment can vary in lighting conditions.

  • Varying Orientations: The text can be skewed or rotated and the text segmentation phase should detect and correct that.

  • Varying Text Color: The text required to be detected and segmented can include multiple colors in the same image.

  • Varying Text Font: The text is not specified to be computer printed or handwritten and the text segmentation phase should handle both of them.

  • Varying Text Size: The text size can vary in the same image.

  • Varying Relative Size: The relative size of the text to the image size is not specified and the module should be able to handle any variations.

  • Multiple Objects on the Same Image: Non-textual element can be part of the textual image.

  • Non-Clear Background: Text can be written on a writing line, not on a blank white paper.

  • Non-Uniform Letter Box: Since handwritten text segmentation is targeting the bounding box, the letter can be in a non-rectangular form and can overlap with other letters boxes [128].

  • Limited Processing Time: The text segmentation phase is a part of a bigger system or a framework. Hence, it should be functioning in a near real-time manner.

The main challenge in the current study is segmenting Arabic connected cursive text as the rest of the challenges can be avoided or were solved previously in the previous studies.

3.2.2 Segmentation suggested solution

A standard optical character recognition module design, a multi-stage module, is followed. The task is divided into multiple sub-phases (i.e. steps). It makes the design simpler, however, the errors accumulate over the steps. The steps in the suggested solution are as follows:

  • Step 1: Page Layout Analysis: This step detects the text and non-text regions of the textual image.

  • Step 2: Image Normalization: To make processing time acceptable, the image is normalized by scaling it to an acceptable size in the current step.

  • Step 3: Binarization and Noise Removal: This step should remove the non-textual parts from the textual image and making the text in a solid black color.

  • Step 4: Skew Correction and Rotation: The image skew is detected and corrected by rotating it.

  • Step 5: Finding Writing Lines: This step targets the baselines on the different writing lines.

  • Step 6: Segmenting Connected Components: The connected components can be a character or multiple characters connected together (i.e. sub-words).

  • Step 7: Generating Multiple Possible Final Segmentations: These segmentations will be passed to the recognition module to be evaluated.

  • Step 8: Final Segmentation Choice: Based on the score returned from the recognition module for each segmentation, this step will decide which segmentation is the best to be the final one.

Table 3 summarizes the challenges with their solutions.

Table 3 Segmentation challenges and solutions summarization

3.2.3 Main challenge discussion

This is mainly handled in steps 7 and 8. In step 7, the segments are generated based on the features extracted from the connected component. A set of rules are then used to form these segments based on the features. These features can be structural features such as loops, endpoints, X-cross points, and T-cross points, or statistical features such as moments and Fourier descriptors [54].

In step 8, the tag scores are used to determine how likely the segmentation is valid. A suggested function would be based on Bayes’ rule [121] by treating each score as probability and trying to find the minimum noise probability.

The following example summarizes the discussion (the numbers and tags here are manually generated to give an illustration and not the actual results). For the 7th step, Fig. 5 shows one possible connected component box and Fig. 6 shows multiple possible generated segmentations. The multiple segmentations are passed to the recognition module and the module assigns tags and numbers corresponding to each tag as shown in Table 4.

Fig. 5
figure 5

One possible connected component box for the word “

Fig. 6
figure 6

Multiple possible segmentations for the word “

Table 4 Example on the segmentation tags and their corresponding numbers

For the 8th step, the goal of this function is to maximize the letter score and minimize the noise score in the overall segmentation. A function is designed to determine the best segmentation based on the tags and scores.

Some suggested functions: (1) Minimizing the Sum of Noise: The sum of noise score is calculated for each possible segmentation and the one having the least noise sum is selected to be the accepted segmentation element and (2) Minimizing the Root of the Mean Squares Summation (RMSS) as shown in (1). They require no information other than the segmentation and the noise scores.

$$ RMSS = \sqrt{\sum{\left( \frac{Noise Score}{Number of Letter Boxes}\right)^ 2}} $$
(1)

3.3 Classification phase

As mentioned earlier, the classification phase accepts each character’s sub-image and outputs the corresponding digital format of that character by passing it through the trained model. The model can be trained from scratch or using the transfer learning approach with the help of the pre-trained CNN models.

This subsection is divided into three inner subsections: (1) designing CNN models from scratch, (2) using pre-trained models (transfer learning approach), and (3) proposing a TF and genetic algorithm (GA) optimization approach for the best training parameters and hyperparameters selection.

3.3.1 Designing CNN models

A Convolutional Neural Network (CNN) is a multi-layer neural network designed to analyze the visual inputs and perform tasks. The input data in our case is the images X as shown in (2).

$$ X = [x_{1}, x_{2}, ..., x_{j}, ..., x_{n}] $$
(2)

where X is the whole dataset, xj is a dataset entity (i.e. element) at a position j, and n is the number of elements in X. Formally, the input to a convolutional layer is an image with a shape shown in (3).

$$ shape(x_{j}) = M \times M \times C $$
(3)

where M is the size of the image (i.e. width and height) and C is the number of channels per pixel. For a grayscale image, it has one channel and hence C = 1 but Red-Green-Blue (RGB) images have three channels and hence C = 3.

CNN is composed of several kinds of layers:

Convolutional layers

The convolutional layer has K filters (i.e. kernels) each filter has a shape that can be computed from (4).

$$ shape(Conv Layer) = N \times N \times R $$
(4)

where N is the size of the filter (i.e. width and height) and R is the number of channels. R should be equal to or less than C. Convolutional layer creates K feature maps of size MN + 1. They are used to predict the class probabilities for each feature by applying a filter that scans the whole image, a few pixels at a time.

Figure 7 illustrates the convolution process on an input image of size M × M with a kernel of size N × N. If the input is 32 × 32 and the kernel is 5 × 5, then the output of the convolution is 28 × 28.

Fig. 7
figure 7

Illustration of the convolution process on an input image of size M × M with a kernel of size N × N

Pooling layers

Each feature map is then pooled typically with mean-, min- or max-pooling over a Q × Q where Q is a range between 2 to 5 for large inputs.

Figure 8 illustrates the pooling process on the feature map that produces a new feature map of size (MN + 1)/2. If the input is 28 × 28, then the output of the pooling layer is 14 × 14.

Fig. 8
figure 8

Illustration of the pooling process on the feature map

Pooling layers are used to scale down the amount of information the convolutional layer generates for each feature and maintains the most essential information. Notice that, the process of the convolutional and pooling layers usually are repeated several times [106].

Fully-connected layers

After the convolutional and pooling layers, one or more fully-connected layers exist(s). The high-level reasoning in the neural network is done via the fully-connected layers.

A fully-connected layer accepts all neurons from the previous layer and connects them to every single neuron it has. Fully-connected layers are not spatially located anymore, so there can be no convolutional layers after a fully connected layer.

Activation functions

Activation functions are non-linear transformations used to determine whether a neuron should be activated or not. There are several activation functions such as Sigmoid, ReLU (Rectified Linear Unit), and Leaky ReLU [91].

The activation function that is used in the current study for the hidden layers of the designed CNN models is ReLU [89].

ReLU is a piece-wise linear function that will output the input directly if it is positive, otherwise, it will output zero. It has become the default activation function for neural network types because the architecture that uses it is easier to train and often achieves better performance than others [50].

The SoftMax classifier function is a generalization of the binary form of Logistic regression [34]. It is commonly used for multi-class classification problems. The mapping function is defined such that it takes an input dataset X and maps them to the output class labels via a simple linear dot product of the data X and weights W.

Learning weights optimization

Optimization algorithms and techniques help us to minimize (or maximize) an objective function (i.e. the error function). An error function is simply a mathematical function that is dependent on the model’s internal learnable parameters. It is used to compare the target values of Y and model predictions.

The gradient descent algorithm [101] updates the parameters to minimize the error function. It takes small steps in the direction of the negative gradient of the loss function as shown in (5).

$$ P_{i+1} = P_{i} - \alpha \times \nabla \times E(P_{i}) $$
(5)

where i refers to the iteration number, ∇ is the learning rate, E is the error function, P is the parameter vector and Pi is the loss function. The gradient of the loss function, ∇× E(Pi), is evaluated using the entire training dataset. The standard gradient descent algorithm uses the entire dataset at once.

The learning rate is a hyperparameter that controls how much we are adjusting the trainable weights of the network concerning the calculated loss gradient. The lower the learning rate value, the slower we move along the downward slope and the more time it takes.

Adam (Adaptive Moment Optimization Algorithm) [71] combines the heuristics of both the Momentum [129] and RMSProp [22] optimization algorithms. They take contrasting approaches. The Momentum optimizer accelerates the search in the direction of the minima while the RMSProp optimizer impedes the search in the direction of oscillations.

There are other weights optimizers such as Nadam [119], AdaGrad [88, 122], AdaDelta [133], AdaMax [60], and FTRL [82].

The authors prepared 14 CNN architectures after try-and-error trials where 3 of them achieved the best results as shown later in the experiments and results section. The 14 CNN architectures are labeled from 1 to 14, so that, the first one is CNN-1, the second one is CNN-2, and so on.

Figures 910, and 11 show the top-3 suggested CNN architectures CNN-5, CNN-12, and CNN-11 respectively for Arabic handwritten character recognition. The layers’ sizes and types are labeled on the figures.

Fig. 9
figure 9

A suggested CNN architecture, CNN-5, to classify the arabic handwritten characters

Fig. 10
figure 10

A suggested CNN architecture, CNN-12, to classify the arabic handwritten characters

Fig. 11
figure 11

A suggested CNN architecture, CNN-11, to classify the arabic handwritten characters

All of the suggested CNN architectures have the last output fully-connected layer with 115 classes. This number refers to the number of classes of the used dataset. The used dataset is described later in the experiments and results section. The used activation functions in the hidden and output layers are ReLU and SoftMax respectively.

Figure 9, CNN-5, is described as follows: INPUTCONVRELUPOOLCONVRELUPOOLCONVRELUPOOLFCRELUFC. The input layer accepts images with a size of 32 × 32. The second layer, C1, is a convolutional layer with a size of 30 × 30 and 64 parallel filters. The third layer, S2, is a max-pooling layer with a size of 15 × 15 and 64 parallel kernels.

The fourth layer, C3, is a convolutional layer with a size of 13 × 13 and 128 parallel filters. The fifth layer, S4, is a max-pooling layer with a size of 6 × 6 and 128 parallel kernels. The sixth layer, C5, is a convolutional layer with a size of 4 × 4 and 256 parallel filters.

The seventh layer, S6, is a max-pooling layer with a size of 2 × 2 and 256 parallel kernels. The eighth layer, FC7, is a fully-connected layer with a number of neurons of 1024. The last layer, FC8, is another fully-connected layer with a number of neurons of 115.

Figure 10, CNN-12, is described as follows: INPUTCONVRELUPOOLCONVRELUPOOLCONVRELUPOOLFCRELUFCRELUFC. The input layer accepts images with a size of 32 × 32. The second layer, C1, is a convolutional layer with a size of 30 × 30 and 64 parallel filters. The third layer, S2, is a max-pooling layer with a size of 15 × 15 and 64 parallel kernels.

The fourth layer, C3, is a convolutional layer with a size of 13 × 13 and 128 parallel filters. The fifth layer, S4, is a max-pooling layer with a size of 6 × 6 and 128 parallel kernels. The sixth layer, C5, is a convolutional layer with a size of 4 × 4 and 256 parallel filters. The seventh layer, S6, is a max-pooling layer with a size of 2 × 2 and 256 parallel kernels.

The eighth layer, FC7, is a fully-connected layer with a number of neurons of 100. The ninth layer, FC8, is another fully-connected layer with a number of neurons of 100. The last layer, FC9, is also another fully-connected layer with a number of neurons of 115.

Figure 11, CNN-11, is described as follows: INPUTCONVRELUPOOLCONVRELUPOOLCONVRELUPOOLFCRELUFC. The input layer accepts images with a size of 32 × 32. The second layer, C1, is a convolutional layer with a size of 30 × 30 and 64 parallel filters. The third layer, S2, is a max-pooling layer with a size of 15 × 15 and 64 parallel kernels.

The fourth layer, C3, is a convolutional layer with a size of 13 × 13 and 128 parallel filters. The fifth layer, S4, is a max-pooling layer with a size of 6 × 6 and 128 parallel kernels. The sixth layer, C5, is a convolutional layer with a size of 4 × 4 and 128 parallel filters. The seventh layer, S6, is a max-pooling layer with a size of 2 × 2 and 128 parallel kernels.

The eighth layer, FC7, is a fully-connected layer with a number of neurons of 50. The last layer, FC8, is also another fully-connected layer with a number of neurons of 115.

3.3.2 Transfer learning approach

VGG16 [113], VGG19 [113], and MobileNetV2 [104] pre-trained CNN models are used in the current study as the base models. The ImageNet [33] weights are used with them to initiate their weights.

Each pre-trained model consists of a set of layers internally and we can control which layers to train and update their weights. For example, freezing all the pre-trained model layers unless the last two layers. In the training (in the current study), the last two layers’ weights will be updated while the rest of the layers will remain with their initial weights.

By decreasing the number of trainable layers, the number of trainable weights will decrease and hence the training time. This may lead to a noticeable decrease in the accuracy and hence selecting a suitable learning percent for the pre-trained model layers is necessary.

A global average pooling layer [28], dropout layer [118, 127], and a fully-connected layer are added after the pre-trained model’s last layer. Similar to max-pooling layers, the global average pooling layers are used to reduce the spatial dimensions (i.e. dimensionality reduction) but in an extreme manner where the feature maps layer with dimensions MN + 1 × MN + 1 × R is reduced in size to have dimensions of 1 × 1 × R as shown in Fig. 12.

Fig. 12
figure 12

Illustration of the global average pooling process on the feature maps

Dropout layer

Dropout is a regularization technique that refers to dropping some of the neurons of the layer that precedes this layer as shown in Fig. 13. Dropout layers are important in training the CNNs as they prevent overfitting [134] on the training data.

Fig. 13
figure 13

Illustration of the dropout process

Selecting a suitable dropout ratio is necessary as selecting a very high ratio will decrease the number of trainable neurons and may lead to a decrease in accuracy.

3.3.3 Proposed TF and GA parameters and hyperparameters optimization approach (HMB-AHCR-DLGA)

From the previous subsections, different DL training hyperparameters are required to be optimized. They are the (1) DL weights optimizers, (2) dropout ratios, (3) batch sizes, and (4) transfer learning model learning ratios in the current study.

Depending on the grid search [53] only, it will lead to a × b × c × d tries where a, b, c, and d are the numbers of DL weights optimizers, dropout ratios, batch sizes, and TF model learning ratios respectively. Hence, a suitable optimization approach for these hyperparameters is required to be applied.

A genetic algorithm (GA) [123] approach is used to solve this issue and to achieve the best combinations faster than the grid search or native searching approach.

GA has different important features [84] such as (1) it combines different solutions to boost the best one, (2) it is used to solve many problems effectively, (3) simpler to implement and deploy, and (4) requires fewer computations compared to other state-of-the-art approaches [45].

The proposed genetic algorithm approach named HMB-AHCR-DLGA is described in detail in this subsection. Initially, all chromosomes are initiated randomly concerning the given ranges (i.e. DL weights optimizers, dropout ratios, batch sizes, and TF model learning ratios). They are constructed as a matrix where each row is a chromosome.

A GA is applied to the whole population. The algorithm steps include the objective function calculation, sorting, selection, crossover, and mutation. After a set of iterations, the final best chromosome that includes the combination with the highest objective function value is achieved upon completion. The steps are described comprehensively in the following paragraphs.

Initial poulation

The initial population is created as a matrix where each row represents a chromosome. The size of the population is initiated as Np. Each chromosome (i.e. row) has a set of genes (i.e. columns). The size of each chromosome is set to 4.

They are (1) the first column is for a DL weights optimizer, (2) the second column is for a dropout ratio, (3) the third column is for batch size, and (4) the fourth column is for a TF model learning ratio as shown in Fig. 14.

Fig. 14
figure 14

Graphical representation of the GA population

The used DL weights optimizers are Adam, Nadam [119], AdaGrad [88, 122], AdaDelta [133], AdaMax [60], RMSProp, SGD, and FTRL [82]. The TF model learning ratio is ranged from 0% to 60% with a step of 5% and hence there are 13 learning ratios.

The batch size is ranged from 32 to 128 with a step of 32 and hence there are 3 sizes. The dropout ratio is ranged from 0% to 60% with a step of 1% and hence there are 61 dropout ratios.

Objective function

Each chromosome is applied to the objective function to determine the corresponding score. The more the value, the better the chromosome. The used objective function is a multi-objective function consisting of the deep learning performance metrics [64].

They are accuracy, loss, F1-score, recall, the area under curve (AUC), and precision. The more the accuracy, F1-score, recall, and precision, the better the deep learning model. The lower the loss, the better the deep learning model. To solve the loss value issue to maximize it, the value of 1/loss is computed instead of the loss value.

The used objective function is designed in a way to maximize and integrate the objectives. To solve the multi-objective optimization problem, the weighted sum approach (WSA) [81] is used. It multiplies each objective by a weight w and finds the sum of them as shown in (6).

$$ \begin{array}{ll} Value = w_{1} \times {Accuracy} + w_{2} \times {\frac{1}{Loss}}~+ \\ w_{3} \times {Recall} + w_{4} \times {F1}~+ \\ w_{5} \times {Precision} + w_{6} \times {AUC} \end{array} $$
(6)

where w1, w2, w3, w4, w5, and w6 are the weight values and their summation equals 1. The accuracy is most intuitive performance metric [23] and hence its weight w1 is set to be the highest. The used values for w1, w2, w3, w4, w5, and w6 in the current study are set to 0.5, 0.1, 0.1, 0.1, 0.1, and 0.1 respectively.

The values of accuracy, recall, F1-score, precision, and AUC are computed after training the transfer learning pre-trained model on the chormosome columns’ values using (7), (8), (9), and (10) [109].

$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$
(7)
$$ Recall = \frac{TP}{TP + FN} $$
(8)
$$ F1 = \frac{2 * Precision * Recall}{Precision + Recall} $$
(9)
$$ Precision = \frac{TP}{TP + FP} $$
(10)

where TP, TN, FP, and FN are the true positive, true negative, false positive, and false negative respectively.

Sorting

After computing the objective function for each chromosome in the population set, they are sorted in a descending order concerning the objective function values.

Selection

The new population is created in this phase by performing three steps. The first step is to divide the chromosomes into two halves where the first half survives and remains in the new population while the second half passes to the second step.

The second step is the crossover [98]. It is the operation of exchanging between two random parents.

To perform a crossover between two parents and generate a new chromosome, the first parent extracts its first two genes (DL weights optimizer and dropout ratio) while the second parent extracts its last two genes (batch size and TF model learning ratio) as shown in Fig. 15. The generated chromosome is passed to the third step.

Fig. 15
figure 15

Graphical representation of the GA crossover process

The third step is the mutation [107]. It is the operation of providing diversity in the search space. It applies a small change in the mutated chromosome. It is preferred to be a low rate to avoid randomness in the search space.

The used mutation rate in the current study is set to 25% which means that only a single gene from the four genes in the mutated chromosome is changed randomly from the available options.

figure f

Iterations

The objective function calculation, sorting, selection, crossover, and mutation are computed iteratively for Ns times. Algorithm 1 summarizes the proposed approach.

4 Experiments, results, and discussion

The experiments in this section are divided into two types. The first type is experiments related to the proposed native CNN models while the second type is experiments related to the proposed transfer learning and genetic algorithm, HMB-AHCR-DLGA, approach.

Table 5 summarizes the experiments’ configurations. They with their corresponding results will be discussed in the following subsections.

Table 5 Experiments configurations summarization

4.1 Dataset

A large and public dataset named “HMBD” [20] is used in the current study. It is composed of 54,115 images in which each is of size 32 × 32 pixels. It is partitioned into 115 classes and used for both training and testing. It can be downloaded and used from https://github.com/hossambalaha/hmbd-v1 or https://www.kaggle.com/hossammbalaha/hmbd-v1-arabic-handwritten-characters-dataset

Data augmentation is applied to the dataset to increase the accuracy and reduce the overfitting. The data augmentation used methods are image rotation, shifting, zooming, and shearing [97].

4.2 CNN architectures

The objective is to choose the best model that fits the dataset well. As mentioned, many try-and-error trials in the network configuration tuning mechanism are performed. This resulted in 14 different CNN architectures, CNN-1 to CNN-14, which are compiled and trained after that.

The main difference between them is the hierarchy of building the architecture layers. The best CNN architecture is selected from them. Then, different combinations of weight initializers and optimizers are applied to the best-selected architectures. The later process aims to report the most suitable weight initializers and optimizers. The training process is performed on Google Colab [24] with its Graphical Processing Unit (GPU) environment.

To select the best and leading architectures, the 14 suggested architectures, CNN-1 to CNN-14, are trained and the best three are selected using two scenarios (1) without data augmentation and (2) with data augmentation.

The number of epochs and batch size are 50 and 32 respectively. The default optimizer and weight initializer are Adam and Glorot Uniform [96] respectively.

Tables 6 and 7 show the reported results for the 14 different CNN architectures without data augmentation and with data augmentation respectively.

Table 6 Results of the 14 CNN architectures after training on HMBD without data augmentation
Table 7 Results of the 14 CNN architectures after training on HMBD with data augmentation

The CNN architectures CNN-12, CNN-14, and CNN-11 reported the top-3 testing accuracy architectures for the first scenario (without data augmentation), and CNN-12, CNN-11, and CNN-5 reported the top-3 testing accuracy architectures for the second scenario (with data augmentation).

Architecture CNN-9 failed to reach valuable accuracy after applying data augmentation on it. This abnormal result means that the model fell in the local minima and could not get out from it.

Figure 16 shows the curve of the 14 architectures in the two scenarios. The blue curve is for the first scenario while the orange curve is for the second scenario. The x-axis represents the 14 architectures and the y-axis represents the testing accuracies.

Fig. 16
figure 16

Testing accuracies before and after data augmentation of the 14 architectures

The top-3 CNN architectures in the second scenario are trained after that on sets of different optimizers and weight initalizers to report the best combinations. The used DL optimizers are Adam, NAdam [119], AdaGrad [88, 122], AdaDelta [133], AdaMax [60], RMSProp, and SGD.

The used weight initalizers are: Glorot Uniform [96], Random Normal, Random Uniform, Truncated Normal, Variance Scaling [47], Orthogonal LeCun Uniform, Glorot Normal [120], and He Uniform.

Tables 89, and 10 report the top-10 training experiments (concerning the testing accuracies) performed on the CNN architectures CNN-12, CNN-11, and CNN-5 respectively using different weight initializers and optimizers applied on the augmented version of the HMBD dataset. The tables are ordered in descending order according to the testing accuracies.

Table 8 Training the CNN architecture no. 12 using different optimizers and weights initializers on HMBD with data augmentation
Table 9 Training the CNN architecture No. 11 using different optimizers and weights initializers on HMBD with data augmentation
Table 10 Training the CNN architecture no. 5 using different optimizers and weights initializers on HMBD with data augmentation

The CNN architecture, CNN-12, reported the highest testing accuracy value of 88.17% with a testing loss value of 0.38 using AdaMAx optimizer and Glorot Normal weight initializer. Figure 17 shows a graphical representation of the testing and training accuracies for CNN-12. The graph shows that the testing accuracy precedes the training accuracy and hence there is an underfitting with a maximum difference of 5.96%.

Fig. 17
figure 17

Graphical representation of the testing and training accuracies for CNN-12

The CNN architecture, CNN-11, reported the highest testing accuracy value of 87.70% with a testing loss value of 0.46 using Adam optimizer and He Uniform weight initializer. Figure 18 shows a graphical representation of the testing and training accuracies for CNN-11. The graph shows that the training accuracy precedes the testing accuracy and hence there is an overfitting with a maximum difference of 5.30%.

Fig. 18
figure 18

Graphical representation of the testing and training accuracies for CNN-11

The CNN architecture, CNN-5, reported the highest testing accuracy value of 91.96% with a testing loss value of 0.23 using SGD optimizer and LeCun Uniform weight initializer. Figure 19 shows a graphical representation of the testing and training accuracies for CNN-5. The graph shows that the testing accuracy precedes the training accuracy and hence there is an underfitting with a maximum difference of 1.94%.

Fig. 19
figure 19

Graphical representation of the testing and training accuracies for CNN-5

This reports the CNN architecture, CNN-5, as the best architecture among others as it achieved the highest testing accuracy of 91.96%.

4.3 The HMB-AHCR-DLGA

The objective is to optimize the hyperparameters to evaluate the best combination set. VGG16 [113], VGG19 [113], and MobileNetV2 [104] pre-trained CNN models with the ImageNet [33] weights. The used DL optimizers are Adam, NAdam [119], AdaGrad [88, 122], AdaDelta [133], AdaMax [60], RMSProp, SGD, and FTRL [82].

The images are used in the RGB format with the size of 32 × 32 × 3. The number of epochs and number of GA iterations are set to 64 and 10 respectively. The training process is performed on Google Colab [24] with its GPU environment and Toshiba Qosmio X70-A with 32 GB of RAM and NVIDIA GeForce GTX 770M GPU. The used batch sizes are [32,64,128]. The used TF model learn ratios are [0,5,10,15,20,25,30,35,40,45,50,55,60]. The used dropout ratios are [1,2,...,60].

Five experiments are performed as follows three experiments are performed on VGG16, one on VGG19, and one on MobileNetV2. Table 11 reports the best combinations with the corresponding hyperparameters for each experiment.

Table 11 Experiments results using the proposed HMB-AHCR-DLGA approach

AdaMax is reported to be the best DL weight optimizer in three experiments. The batch size of 64 is reported to be the best in three experiments. The dropout ratio of 10% is reported to be the best in three experiments.

The TF model learning ratio 60% is reported to be the best in three experiments. Two experiments using VGG16 (first and third) report testing accuracies better than the highest testing accuracy in the native CNN models experiments (91.96%).

The lowest testing loss is 0.374 by the first VGG16 experiment. The highest testing accuracy is 92.88% by the third VGG16 experiment. The highest testing F1-score is 93.21% by the third VGG16 experiment. The highest testing precision is 93.71% by the third VGG16 experiment.

The highest testing recall is 92.74% by the third VGG16 experiment. The highest testing AUC is 99.24% by the second VGG16 experiment.

As mentioned, the highest testing accuracy is 92.88% which is achieved by the third VGG16 experiment using AdaMax weights optimizer, 64 batch size value, 0% dropout ratio, and 60% TF model learning ratio. It can be considered as the best combination using the VGG16 pre-trained TF model for the HMBD dataset.

It worth mentioning that, the HMB-AHCR-DLGA approach reported higher testing accuracy by the third VGG16 (92.74%) than the suggested abstract CNN model, CNN-5 (91.96%). However, the CNN-5 architecture is less complex than the third VGG16 concerning the number of parameters and neurons. Hence, it requires less memory, processing, and time in the prediction phase.

As the difference between them is less than 1%, we can select the third VGG16 if we are targeting the prediction accuracy and we can select the CNN-5 architecture if we are targeting the prediction speed.

5 Conclusions and future work

Handwritten Character Recognition for Arabic characters is an active research area that always needs an improvement in results. The study presented different AHCR systems concerning the related work. In this study, CNNs are used for the recognition of Arabic handwritten characters problem using the 28 Arabic characters. CNN is a well-known state-of-the-art algorithm for classifying images and big data.

The deep learning-based AHCR system is developed. The study presented the different challenges related to text segmentation by describing a set of solutions that can be further improved in future studies.

The segmented elements should be entered into the recognition phase to be converted from handwritten characters format into the printed ones. The system can be divided into two main phases. The first one is to divide the given handwritten text into segments. The second is to convert the segments into printed ones.

From the trials, it is very difficult to use a general formula to produce text segments due to the cursive nature and ligatures of the Arabic language. Handwriting varies from a person to another, different lighting, and many other presented reasons. The solution for each of these challenges has been suggested.

In the second phase, the classification phase, different experiments in two paths were performed and the results are reported.

For the first experimental path, 14 CNN architectures were suggested after lots of trials. The training was performed on the HMBD dataset. Two scenarios (1) without data augmentation and (2) with data augmentation were applied to the architectures.

The testing accuracy was used as the performance metric to judge the best architecture. Different optimization methods were also applied to increase the CNN performance. The top-3 in each scenario were reported. The CNN architectures CNN-12, CNN-14, and CNN-11 reported the top-3 testing accuracies for the first scenario, and CNN-12, CNN-11, and CNN-5 reported the top-3 testing accuracies for the second scenario.

The top-3 CNN architectures were trained after that on sets of different optimizers and weight initializers. The CNN architecture, CNN-12, reported the 88.17% as the highest testing accuracy value with a testing loss value of 0.38 using AdaMAx optimizer and Glorot Normal weight initializer. The CNN architecture, CNN-11, reported 87.70% as the highest testing accuracy value with a testing loss value of 0.46 using Adam optimizer and He Uniform weight initializer.

The CNN architecture, CNN-5, reported 91.96% as the highest testing accuracy value with a testing loss value of 0.23 using SGD optimizer and LeCun Uniform weight initializer. The CNN architecture, CNN-5, reported the best architecture among others as it achieved the highest testing accuracy value 91.96%

For the second experimental path, a TF and GA approach (named “HMB-AHCR-DLGA”) was proposed to optimize the training parameters and hyperparameters. The HMBD dataset with data augmentation was used in the training. Accuracy, recall, F1-score, AUC, and precision were used as the performance metrics to judge the best pre-trained CNN architecture.

The highest testing accuracy using the HMB-AHCR-DLGA approach was 92.88% which was achieved by the third VGG16 experiment using AdaMax weights optimizer, 64 batch size value, 0% dropout ratio, and 60% TF model learning ratio. It could be considered as the best combination using the VGG16 pre-trained CNN model.

Even though there is a lot of work for the recognition of handwritten characters for English, Chinese, French, and some other Indian languages, only a little work was done for the Arabic language. Due to the lack of research work in the area, there was a big challenge to get a dataset for the Arabic language.

In future studies, the plan is to (1) work on improving the performance of the handwritten text segmentation phase, (2) work on Arabic handwritten word recognition using deep learning architectures, (3) apply the system on different available datasets, and (4) try different machines learning or deep learning approaches.