1 Introduction

Automatic handwritten character recognition (AHCR) is important for various applications. There are a multitude of sources for handwriting, such as images, paper documents, and touch screens [1], and there is a growing demand for an accurate application of handwriting recognition that can be used with different source types. The AHCR is characterized as the system’s capabilities to recognize handwritten input images [2]. It uses character recognition technology to convert characters into their corresponding digital characters, thereby providing a method for automatically recognizing text in images. AHCR is considered a challenging task because the handwriting of most people differs. Moreover, individual writers’ handwriting abilities can change significantly over time [3].

Over the last few decades, AHCR has been an active area of research, and many AHCR methods have been developed to identify different languages. The most common languages are Chinese [4, 5], English [6, 7], and French [8]. Arabic is one of the languages most commonly spoken worldwide, with more than 315 million native speakers. Arabic character recognition has recently received research attention [9]. Recognizing Arabic characters poses a significant challenge in the fields of computer vision and pattern recognition because of the unique characteristics of the Arabic language, such as its distinct spelling, grammar, and pronunciation, compared to other languages [10]. Arabic comprises 28 characters and is typically written in a semi-cursive style from right to left, where the letters are interconnected in a continuous flow. Arabic characters can exhibit four distinct forms based on their position within a word: beginning, middle, end, or standalone. Furthermore, the similarity in shape among Arabic letters presents a difficulty [11]. Table 1 presents the variations in Arabic letters depending on their position in words. It can be noticed that, for instance, the letters “ba,” “ta,” and “tha ” share a noticeable similarity despite being presented in four different positions within a word. Given the variability in character shapes depending on the context within a word, automated handwriting recognition of Arabic characters is considerably more complex than that of other languages.

Table 1 Twenty-eight different Arabic alphabet shapes

Advancements in deep learning have enabled convolutional neural networks (CNNs) to demonstrate a outstanding ability to identify handwritten characters in various languages, such as Latin, Chinese, Devanagari, and Malayalam [12, 13]. Researchers have enhanced CNN architectures to improve the recognition performance of handwritten characters [13, 14]. This enhancement typically involves fine-tuning CNN hyperparameters, selecting appropriate optimization algorithms [15] and along with utilizing a substantial training dataset [16, 17]. In this study, a new deep CNN model called DeepAHR was developed to recognize handwritten Arabic characters. The proposed DeepAHR was thoroughly tested using two public benchmark datasets: Arabic handwritten characters dataset (AHCD) [2] and Hijaa [18]. The results and comparisons show that this method outperforms the state-of-the-art methods. The main contributions of this study are as follows:

  • Reviewing state-of-the-art research in Arabic handwritten character recognition.

  • Developing an effective Arabic handwritten character recognition model based on a CNN.

  • Investigating and analyzing the impact of different regularization techniques and hyperparameters on the performance of the proposed CNN method.

  • A comprehensive method evaluation using two benchmark datasets is provided and compared with state-of-the-art methods.

The remainder of this paper is organized as follows: Sect. 2 presents the related work, and Sect. 3 details the proposed method. Section 4 presents the experimental details, results, and discussion. The conclusions and future work are presented in Sect. 5.

2 Related work

Recently, researchers have developed various techniques to improve Arabic handwritten character recognition results based on CNNs. El-Sawy et al. [18] found that CNN methods outperformed other approaches for feature extraction and classification, particularly with large datasets. However, available handwritten Arabic datasets include only a limited number of images. Therefore, the authors released the AHCD, which was collected from 60 participants aged 19–40 years. The authors proposed a model based on a CNN that achieved 94.9% accuracy on the AHCD. Similarly, Altwaijry and Al-Turaiki [2] released Hijaa dataset containing samples produced by children aged 7–12 years. The researchers introduced a CNN-based system for Arabic handwriting recognition and compared its performance with that of El-Sawy et al. [18]. The empirical results revealed that Altwaijry and Al-Turaiki’s model achieved 97% and 88% accuracies for the AHCD and Hijaa datasets, respectively. These results demonstrate that the proposed CNN outperformed the El-Sawy et al. model [18]. Balaha et al. [19] created a complex and extensive Arabic handwriting dataset known as HMBD. They proposed two CNN approaches, HMB1 and HMB2, employing different optimization, regularization, and dropout methods. HMB1 and HMB2 were evaluated on three datasets (AIA9k, CMATER, and HMBD) in 16 experiments. The best results were 98.4%, 97.3%, and 90.7% for AIA9k, CMATER, and HMBD, respectively. Furthermore, the study revealed that data augmentation helped reduce overfitting and increased accuracy.

Ahmed et al. [20] designed a CNN that employed dropout regularization and batch normalization layers to extract optimal features. To assess the effectiveness of the model, the authors evaluated the model on a set of six benchmark datasets: MADBase (digits), SUST-ALT (digits), CMATERDB (digits), SUST-ALT (characters), HACDB (characters), and SUSTALT (names). The model achieved 99% accuracy; however, the model was not evaluated on AHCD. Younis [21] built a CNN with three convolutional layers and a fully connected layer with an overfitting regularization parameter. Experimental results revealed that the proposed approach achieved accuracies of 94.8% and 94.7% on the AIA9K and AHCD datasets, respectively. AlJarrah et al. [22] constructed a CNN model and examined the impact of data-augmentation techniques on its performance. Their results revealed that the model accuracy increased from 97.2% to 97.7% after applying data augmentation to the AHCD dataset. Elleuch et al. [23] proposed a deep belief neural network (DBNN) for recognizing handwritten Arabic characters and words. Their results demonstrated a model accuracy of 97.9% using the HACDB dataset. Elagamy et al. [24] designed a customized CNN handwritten Arabic character recognition approach utilizing deep learning. The proposed approach was evaluated on AHCD, achieving an accuracy rate of 98.54%. Momeni and BabaAli [25] introduced two distinct transformer architectures, namely the transducer and standard sequence to sequence, and assessed their effectiveness in terms of speed and accuracy on the KFUPM handwritten Arabic text (KHATT) dataset [26]. Similarly, in [27], a light encoder–decoder transformer approach was presented for handwritten text recognition, and in [28], an end-to-end method utilizing pre-trained image and text transformers methods for word-level text recognition was suggested.

Several studies have investigated the use of transfer learning to propose solutions for Arabic character handwriting recognition. Alyahya et al. [29] studied the effect of applying the ResNet-18 architecture, and the model was trained and evaluated on AHCD. The best accuracy achieved was 98.3%, utilizing a standard ResNet-18 model. Similarly, the model achieved accuracies of 98.03% and 98.00% by combining ResNet-18 with one fully connected layer and using two fully connected layers with ResNet-18, respectively. Mudhsh et al. [30] proposed a VGG-16-based CNN, and it was trained and evaluated using two benchmark datasets: HACDB for character recognition and MADBase for digit recognition. The model achieved accuracies of 97.32% on the HACDB dataset and 99.66% on the MADBase dataset. Al-Tani et al. [31] adopted the ResNet architecture for handwritten Arabic character recognition. Using AHCD, AIA9K, and MADBase, the accuracies achieved using this model were 99.55%, 99.05%, and 99.8%, respectively. Korichi et al. [32] performed various experiments with different CNN architectures, such as VGG-16 and ResNet, combined with regularization techniques, such as data augmentation and dropout. According to their findings, handcrafted features were less effective than CNN-based methods.

3 Materials and methods

This section discusses the techniques and methods utilized to build the DeepAHR system for detecting handwritten Arabic characters.

3.1 Dataset

Two recent and publicly available datasets were used in this study: AHCD [18] and Hijaa [2]. AHCD contains 16,800 handwritten letters gathered from 60 participants aged from 19 to 40 years, with 90% of them being right-handed. In AHCD, the total number of Arabic class labels is 28 (i.e., from the letter “alef” to “yaa”). Each participant provided a set of twenty-eight letters written ten times. A sample of letters in AHCD is shown in Fig. 1. The dataset was partitioned into two sets, with 80% of the characters used as a training set and the remaining 20% used as the test set. In contrast with the test set, which included 3,360 letters split into 120 images each class, the training set had 13,440 characters partitioned into 480 images.

Fig. 1
figure 1

Samples of Arabic characters in the training set for AHCD

The second dataset was the Hijaa dataset, which is the largest existing dataset for Arabic character recognition. The dataset was collected from Arabic-speaking children aged 7 to 12 years and is consisted of 47,434 characters created by 591 participants. The dataset is partitioned into 29 files corresponding with the 28 Arabic letters (i.e., from letters “alef” to “yaa”) and one file for the Hamza. The letters were written in both isolated and connected forms depending on their positions: at the beginning, middle, and end of a word. A sample of letters from the Hijaa dataset is shown in Fig. 2.

Fig. 2
figure 2

Samples of Arabic characters in the training set of the Hijaa dataset

3.2 Dataset preprocessing

Data preprocessing is a vital step in preparing data for the best-fitting machine learning model [33]. In this study, the images in AHCD were rotated, as all images were flipped. Figure 3a shows a subset of the images (from AHCD) without any modification, and Fig. 3b shows the same images after transposing. The images in both datasets were normalized by dividing them by 255 and converted into NumPy arrays to use less memory space and increase the training speed. Subsequently, different data augmentation techniques, such as zooming and rotation, were applied to increase the size of the dataset, solve the overfitting problem, and make the model more robust [34]. The details of the data augmentation parameters are listed in Table 2.

Fig. 3
figure 3

Sample AHCD letters: a before transposing, b after transposing

Table 2 Data augmentation techniques with parameter values

3.3 Proposed DeepAHR model

CNNs have proved to be a powerful model for automatic feature extraction and have become state of the art in various image classification problems owing to their high performance in recognizing image patterns. CNNs are a type of deep learning model specifically tailored for analyzing data with a grid-like formation, such as images. They draw inspiration from the structure of the visual cortex in animals [35] and are designed to autonomously and dynamically learn hierarchical spatial features, progressing from basic to more complex patterns. CNNs are essentially mathematical frameworks comprising three key types of layers: convolution, pooling, and fully connected layers, as well as an output layer. The convolution and pooling layers focus on extracting features, whereas the fully connected layer is responsible for translating these extracted features into a final output, similar to classification [36, 37]. In CNNs, an image is convolved with filters in the convolution layer to produce feature maps, which are then forwarded to the succeeding layers to extract a complex feature from the input image.

This study proposes a new CNN model called DeepAHR, which is composed of five convolution layers and two fully connected layers. Furthermore, there are activation, pooling, and batch normalization layers between the convolutional and fully connected layers, as shown in Fig. 4. In this section, the proposed DeepAHR method is discussed.

Fig. 4
figure 4

Proposed DeepAHR model

3.3.1 Input layer

The input layer of a CNN is an \(H \times W \times D\) image, where H represents the height, W represents the width, and D represents the depth of the pixels. Our model’s input image was a \(32 \times 32 \times 1\) gray-scale image representing Arabic characters fed into the input layer. In CNNs, the input layer gives only the shape of the image, without feature extraction. The input layer then feeds the images into the hidden layers.

3.3.2 Hidden layers

In a CNN, the hidden layers are composed of convolutional, pooling, and fully connected layers. Using the input image, the convolutional layers perform feature extraction, where significant information that assists in classification, such as edges, corners, or endpoints, is identified. Our model comprises five convolution layers, each of which uses a leaky rectified linear unit (LeakyReLU) as the activation function. LeakyReLU is based on the popular nonlinear ReLU activation function [38]; however, it adopts a small slope for negative values as an alternative to the use of a flat slope in ReLU [39].

Each convolutional layer used a small kernel size of \(3 \times 3\), because the input image size was \(32 \times 32 \times 1\) and a smaller filter size was optimal for this classification task. All convolutional layers employ various kernels to generate a feature map for extracting low- and high-level features, such as edges, endpoints, and vertices, from the input image. Furthermore, we used zero padding in each convolutional layer to prevent the loss of information around the image perimeters and problems associated with image shrinking. We also set the stride of the convolution along the height and width of the image to one.

The first convolutional layer used 32 filters, a stride (\(s = 1\)), the same padding of size 1, and a kernel size of (\(3 \times 3\)), yielding an output shape of size \(32 \times 32 \times 32\) where the output shape could be computed as (\(filters+2 \times padding-(kernel size-1)\)). The activation size of the layer was the dot product of the output shapes, giving the first layer an activation size of 32,768 elements. Table 3 lists the activation sizes and output structures of all the layers.

Table 3 DeepAHR output structure, size, and trainable parameters of the layers

The next four layers are 2D convolutional layers, followed by a max-pooling layer and a batch normalization layer. To keep the network representative, we increased the number of feature maps as the network deepened after each pooling layer. The four subsequent 2D convolutional layers used 64, 128, 256, and 512 filters, respectively. Max-pooling provides the maximum value from the patch of the image covered by the kernel. After each convolutional operation, max-pooling with a \(2\times 2\times 1\) window size was employed to reduce the size of the features, which aided in lowering network dimensionality. Eliminating insignificant parameters also helped prevent overfitting and decreased computational complexity. Batch normalization is a method for training very deep networks that rescales and recenters the inputs into layers to standardize them. This enhances and accelerates the stability of the learning process during network training while also decreasing the number of epochs required to train the networks. As a result, all convolutional layers other than the first had integrated max-pooling and batch normalization layers. The output of the fifth convolutional layer was fed to the global average pooling layer, which averaged each feature map, and then fed into the fully connected layer, that is, the dense layer.

The final step in the hidden layers of the proposed CNN consisted of two fully connected layers with sizes of 256 and 512 neurons, with all neurons connected to the activation units of the subsequent layer. The fully connected layer was followed by a 40% dropout rate to reduce overfitting, which was selected experimentally.

3.3.3 Output layer

The output layer employs softmax as an activation function, which classifies the features into multiclasses as required. In AHCD, the output layer is composed of 28 neurons, whereas it is composed of 29 neurons in the Hijaa dataset.

4 Experimental results

4.1 Experimental setup

The implementation and evaluation of the proposed DeepAHR model were conducted using Keras deep learning environments with a TensorFlow backend and a GPU accelerator on Google Colab Pro.

4.1.1 Performance measures

The performance of our proposed model was evaluated using the following measures:

  • Accuracy: The ratio of correctly classified images to the total number of predicted images [40]. Equation (1) shows the formula used to compute accuracy:

    $$\begin{aligned} accuracy=\frac{TP+TN}{TP+TN+FP+FN}. \end{aligned}$$
    (1)
  • Recall: The proportion of correctly classified images among all images in class x [33], computed using Eq. (2):

    $$\begin{aligned} Recall=\frac{TP}{TP+FN}. \end{aligned}$$
    (2)
  • Precision: The proportion of correctly classified images among all classified images [40], computed using Eq. (3):

    $$\begin{aligned} Precision=\frac{TP}{TP+FP}. \end{aligned}$$
    (3)
  • F1-score: The weighted average of recall and precision [33], computed using Eq. (4):

    $$\begin{aligned} F1-score=\frac{2Precision \times Recall}{Precision+Recall}. \end{aligned}$$
    (4)

where false positive (FP) represents the total number of images that were incorrectly classified as belonging to class x, true positive (TP) represents the total number of images that could be correctly identified as belonging to class x, false negative (FN) represents the total number of images that were incorrectly classified as not belonging to class x, and true negative (TN) represents the total number of images that could be correctly identified as not belonging to class x.

4.1.2 Training and parameters optimizations

Several attempts were made to tune the network configuration to select the best model that fits both the AHCD and Hijaa datasets. The optimized parameters employed to enhance the performance of the CNN are listed in Table 4. Categorical cross-entropy, which is widely used to measure losses in multiclass label predictions, was employed as the loss function. The model was tested using various numbers of epochs, and the final optimal number of epochs was set to 100. A small batch size of 32 was used, which demonstrated a suitable generalization of the model.

Table 4 Values of parameters employed in the proposed framework during training

One of the key hyperparameters is the optimizer algorithm, which fits both the AHCD and Hijaa datasets. To determine the best algorithm for the optimizer that fits both datasets, five optimizers were tested: Adam, AdamW, Adagrad, Nadam, and RMSprop with three different learning rates (lr): 0.001, 0.0001, and 0.00001. This totaled 15 experimental models for each dataset and 30 experiments overall. The detailed results of different optimization algorithms are listed in Table 5. The results show that the best accuracy is achieved when adopting the “Nadam” optimizer with a learning rate of 0.001 for both datasets. On AHCD, the proposed model achieved an average overall test set accuracy of 98.66%, recall of 98.66%, precision of 98.68%, and F1-score of 98.66%. For the Hijaa dataset, our model achieved an average overall test accuracy of 88.24%, recall of 91.4%, precision of 91.4%, and F1-score of 91.5%.

Table 5 Experimental results using different optimizers and learning rates

The proposed model was trained for over 100 iterations. However, by 35th epoch, the model achieved over 99.05% training accuracy and 98.21% validation accuracy for AHCD. On the Hijaa dataset, after training for 100 epochs, the model achieved a 94.6% training accuracy by 62nd epoch. Therefore, the overall validation accuracy was 91.3% during the validation phase.

Fig. 5
figure 5

Training progress for AHCD: a training and validation accuracy (higher is better), and b training and validation loss (lower is better)

Fig. 6
figure 6

Training progress for the Hijaa dataset: a training and validation accuracy (higher is better), and b training and validation loss (lower is better)

Figures 5 and 6 show the training and validation accuracies with respect to epochs on the AHCD and Hijaa datasets, respectively. Figures 5a and 6a show that no overfitting was observed during the training process. From the curve of the loss function (Fig. 5b), it can be observed that the value of the loss starts to drop sharply on AHCD, whereas there are some fluctuations on the Hijaa dataset, as shown in Fig. 6

4.2 Results and discussion

The classification report, which included the overall performance measures and individual character values (Table 6), shows promising results for the DeepAHR model. It should be noted that all class numbers starting from 0 to 27 correspond to the alphabet from “alif” (أ) to "ya" (ي) for both datasets, whereas number 28 corresponds to the “hamza” (ء) alphabet in the Hijaa dataset.

Table 6 Classification reports for the AHCD and Hijaa datasets

The overall accuracy of the proposed model for AHCD is 98.66%. The model was evaluated in terms of precision, recall, and F1-score. The average precision, recall, and F1-score were 98.68%, 98.66%, and 98.66%, respectively. For the Hijaa dataset, the overall model accuracy was 88.24%, and the average precision, recall, and F1-score were 91.4%, 91.4%, and 91.5%, respectively.

The results obtained from DeepAHR differed by class for both datasets. The characters 7 (dal “د”) and 8 (thal “ذ”) are more difficult to recognize in the Hijaa dataset than they are in AHCD. Figure 7 shows various forms that characters 7 (dal “د”) and 8 (thal “ذ”) can take, as many people write them very similar to letters 5 ( haa “ح”) and 6 (kha “خ”). Figure 7a shows how character 7 (dal “د”), when positioned at the disconnected end of a word, can be written similarly to character 8 ( haa “ح”) when positioned at the beginning of an Arabic word, as shown in Fig. 7b. Figure 7c and d shows how characters 8 (Thal “ذ”) and 6 (Kha “خ”) can be written similarly.

Fig. 7
figure 7

Different forms of letters (د), (ح), (ذ), and (خ) written similarly

Furthermore, characters 18 (gayn “غ”) and 19 (f “ف”) are written similarly in the middle of Arabic as shown in Fig. 8. Additionally, Fig. 9 shows how characters 24 (mim “م”) and 17 (ayn “ع”) can be written similarly when positioned in the middle of a word.

Fig. 8
figure 8

Letters (غ) and (ف) written similarly

Fig. 9
figure 9

Letters (م) and (ع) written similarly

Character 24 (non “ن”) is also written similarly to character 8 (thal “ذ”), 6 (kha “خ”), and 10 (Zay “ز”) when its position is at the beginning or end of a word, as shown in Fig. 10. This is reflected in its metric, as non “ن” has an F1 score of 0.97 in AHCD, compared with an F1- score of 0.82 in the Hijaa dataset.

Fig. 10
figure 10

Different ways of writing the letter non

4.3 Comparison with existing works

We evaluated our proposed methodology by comparing it with state-of-the-art approaches that focus on recognizing handwritten Arabic characters using the AHCD and Hijaa datasets, as listed in Table 7. Experimental results from AHCD showed that the DeepAHR model outperformed the models used by El-Sawy et al. [18], Younis et al. [21], Alyahya et al. [29], and Alheraki et al. [41]. For the experiments conducted on the Hijaa dataset, DeepAHR achieved better results in terms of accuracy than the models used by El-Sawy et al. [18], Younis et al. [21], and Alyahya et al. [29], but not the model used by Alheraki et al. [41]. However, DeepAHR outperformed the model used by Alheraki et al. [41] in terms of recall, precision, and F1-score metrics. Notably, there was a significant difference in the experimental results between the methods on the AHCD and Hijaa datasets. The methods performed poorly on the Hijaa dataset, which implies that the Hijaa dataset presented challenges because it included different forms of each character, including both connected and isolated forms. This represents a higher level of similarity between characters. In addition, the Hijaa dataset was collected from children. In contrast, Arabic characters in AHCD were isolated and collected from adults.

Table 7 Comparison between our proposed model and state-of-the-art methods
Fig. 11
figure 11

“DeepAHR" evaluation of Arabic letter images from AHCD

Fig. 12
figure 12

“DeepAHR" evaluation of Arabic letter images from the Hijaa dataset

Furthermore, DeepAHR was tested using unseen letters from a test set that produced remarkable results. In this step, eight images were randomly selected from each of the AHCD and Hijaa test datasets. DeepAHR printed the actual and predicted labels for the selected alphabetical images, as shown in Figs. 11, 12.

5 Conclusion

In this study, we proposed a novel "DeepAHR" model for Arabic handwritten character recognition. The "DeepAHR" model is based on a CNN that consists of five convolution layers and two fully connected layers. LeakyReLU was adopted as the activation function for all the layers of the models. Batch normalization was used to enable independent learning for each layer in the model. To determine the best optimizer, five optimizers were tested with three learning rates for each optimizer across 30 experiments on two public datasets: AHCD and Hijaa. The results show that the ’Nadam’ optimizer with a learning rate of 0.001 yields the best accuracy for both datasets. We applied data augmentation to address the problem of insufficient handwritten Arabic datasets and improve model generalization. DeepAHR achieved accuracies of 98.66% and 88.24% for AHCD and Hijaa, respectively.

An interesting future direction would be to evaluate the outcomes of alternative augmentation techniques such as generative adversarial networks and adversarial training when applied to an Arabic handwritten letter recognition dataset. In addition, it would be beneficial to create new datasets of different Arabic handwriting styles, such as Naskh, Reqaa, and Kufi. The DeepAHR model could be integrated into various applications, such as digital document processing and automated translation services, to enhance their efficiency in handling Arabic handwritten texts.