Keywords

1 Introduction

Arabic script is an alphabet written from right to left which contains two types of symbols for writing words: letters and diacritics. Letters consist of two parts: letter form and letter mark. The letter form is an essential component in each letter with a total of 19 letter forms. The letter marks may be dots, short Kaf, or Hamza letter mark. Hamza is used for both the letter form and the letter mark, which appears with other letter forms. The Madda letter mark is a Hamza variant; Fig. 1 indicates the Arabic script. Diacritics, the second symbol in writing Arabic words which is not essential in writing like the main letter. Three types of diacritics are there: Vowel which are Fatha , Damma , Kasra , or Sukun means no vowel, Nunation which is a doubled version of their corresponding short vowels are two Fathas, two Dammas, two Kasras, and Shadda which is a consonant doubling diacritic. Figure 2 shows types of Arabic diacritics [1,2,3]. There are many challenges do exist when the Arabic script is written by hand and that is due to its unique nature, it is cursive and overlapping occurs between Arabic letters, each character has more than one shape, and other challenges [4]. To build a recognition system for Arabic handwriting words researchers need a real and substantial database [5, 6]; hence, this work is a contribution in producing an Arabic database to help researchers in this field and to overcome challenges existed in previous databases. The databases are developed based on an algorithm that uses stroke capturing to facilitate recognition of Arabic characters [7]. The proposed databases (AOLAH) are typical formats of online handwritten data which is a sequence of coordinate points of the moving pen point. Connected parts of the pen trace, in which the pen point is touching the writing surface, are called strokes [8,9,10].

Fig. 1.
figure 1

Arabic script.

The remainder of this paper is organized as follows: a discussion about existing Arabic databases is shown in Sect. 2. Section 3 presents the proposed AOLAH databases for Arabic online handwriting letters and strokes. While Sects. 4 and 5 describe the proposed Arabic online handwriting recognition algorithm with showing the optimum proposed recognition model Conclusions are given in Sect. 6.

2 Present Databases for Handwritten Arabic Letters

This section describes the main databases used in online Arabic handwriting recognition researches. Table 1 shows a summary of these databases.

2.1 LMCA (2008) [11, 12]

The On/Off (LMCA) dual Arabic handwriting database; this abbreviation is from the French sentence which is Lettres, Mots et Chiffres Arabe. This database contains 30,000 digits, 100,000 Arabic letters and 500 Arabic words; there were 55 participants invited to contribute. This database was developed by REGIM laboratory which is abbreviated for REsearch Group on Intelligent Machines. Both on/off line handwritten characters and words were considered. LMCA database is limited to a small set of words, and the letters are collected separately which means not segmented from cursive text.

2.2 OHASD (2010) [13]

This database is considered as first online Arabic sentence database handwritten on tablet PC. The final version of this dataset is composed of 154 paragraphs, selected from public daily news, written by 48 writers, having a total of 3,825 words and 19,467 characters, after excluding erratic/illegible handwritings. This database has a limited lexicon, limited data, and a limited number of writers.

Table 1. Main present online databases.
Fig. 2.
figure 2

Arabic diacritics types.

2.3 ADAB (2011) [14, 15]

This database was developed by the institut fuer Nachrichtentechnik and the research group on intelligent machines (REGIM). It contains online samples of 937 Tunisian city names that consist of 33,164 Arabic words which are 174,690 characters written by approximately 166 writers. It is used in competitions. The data are available in isolated word samples which are not a natural Arabic online handwriting, and no segmentation of the words into letters is provided.

2.4 ALTEC (2014) [16]

This database is produced by the Arabic language technology center (ALTEC) for online Arabic text with a large lexicon. It consists of 152,680 samples of 39,945 unique words, including 325,477 samples of 14,740 unique parts of a word, the database is collected from approximately 1,000 writers where samples are complete sentences that include digits and punctuation marks and the collected data is available on sentence, word and character levels. The main drawback of this database is that the data are collected by using a device digitally captures and stores everything written or drawn with ink on ordinary paper.

2.5 QHW (2014) [17]

The Quranic handwritten words database is the most commonly used words in the holy Quran. Handwritten words were chosen as the most common words repeated in the holy Quran. The initial version of QHW database includes 120 handwritten words and divided equally into two sets written by 200 writers in total. The QHW database contains 12,000 sample including more than 42,800 characters and 23,300 sub words. This database is a closed vocabulary database and has samples of a limited number of words.

2.6 Online-KHATT (2018) [18]

The Online-KHATT database contains more than 80,000 Arabic words written by 623 writers with approximation 801,421 characters using a source text that covers several domains to ensure a wide range of topics. Online-KHATT database may be considered as the largest Arabic online text database in terms of the number of lines written with electronic pens using natural Arabic text; however it ignored dealing with characters on the base of its strokes.

3 Proposed AOLAH Databases for Arabic Online Handwriting Letters and Strokes

Due to the drawbacks presented in the previous databases there is an essential need for databases overcomes those drawbacks. This work tries to seed a seed in this field. The proposed Arabic online handwriting recognition algorithm that is used in collecting databases mainly depends on the idea of collecting strokes as a separate unit as the stroke is the first base of any word. To do the process of stroke capturing we had developed an algorithm that was written by MATLAB. This algorithm provides a GUI to display the collected data from pen movements, theses pen movements were simulated by mouse where pen down is simulated by mouse left click, pen movement is simulated by holding the mouse left click while writing, and pen up is simulated by releasing mouse left click. The input pen movements are collected as a sequence of points and further are stored in a text file. The text file storage is required to retain original pen movements that are required at later stages in recognition beginning with preprocessing [19, 20]. Furthermore, those text files may also be used to verify the input stroke shape by the help of any application that may visualize data like Microsoft excel. Figure 3 indicates the graphical user interface of the developed application to collect the databases with the Arabic character zha which is written in three strokes and the screenshot of the data stored in the text file for this character is shown in Fig. 4, where the beginning and end of each stroke is clarified in the table.

The Proposed AOLAH databases are contributions from Faculty of Engineering, Aswan University to help researchers in the field of online handwriting recognition to build a powerful system to recognize Arabic handwritten script. AOLAH stands for Aswan On-Line Arabic Handwritten where Aswan is a small beautiful city located at the south of Egypt. Word On-Line in database’s name means that the databases are collected the same time as they are written. While, Arabic word is used because these databases are just collected for Arabic characters; and Handwritten word since these databases are written by the natural human hand.

Fig. 3.
figure 3

GUI for the proposed data collection.

Fig. 4.
figure 4

Sample of data collected in csv file.

In order to collect data, we had used a help from volunteers students of Faculty of Engineering, Aswan University with ages from 18 to 20 years old.

To facilitate the procedure of collecting data to the volunteers we had prepared a collecting form with all steps needed to be done by students and we did not mention any constraints on the writing style. The indications include creating a folder for each volunteer and writing the 28 characters of Arabic script using the GUI. A total of 97 volunteers were participated All these files are reviewed to guarantee the accepted files for the database. A total of 2,520 files are accepted from the 97 volunteers, representing 90 files for each character after excluding unaccepted files. A second database is extracted from the previous accepted database by extracting strokes from characters. 17 strokes are separated from 28 characters and a database of 1,710 files representing strokes was created, strokes shapes selected with their IDs are shown in Table 2. We have demanded from Aswan University, that we had used their resources to collect the databases, to make these databases available for free.

Fig. 5.
figure 5

Stages of online recognition system.

Fig. 6.
figure 6

System used in verifying the databases [a] Training and validation phase, [b] Testing phase.

Table 2. Arabic characters strokes with IDs.

In order to verify our collected databases, we should use them in building a recognition system. Most of the online recognition systems follow typical structure of pattern recognition systems; which basically consist of five major stages, data collection, preprocessing, feature extraction, recognition, and postprocessing as illustrated in Fig. 5 [21,22,23,24]. Other researchers claim that the recognition system typically comprises of two stages, training and test stages. In the training stage, data are refined, their remarkable features are extracted, similar symbols are merged (clustering) and their features’ representatives are stored as training samples, while during the test stage matching takes place for identifying similar features with test features in classification. This recognition system will be used in verifying our databases and its block diagram is shown in Fig. 6 [25].

4 Proposed Arabic Online Handwriting Recognition Algorithm

4.1 Preprocessing Stage

According to the model, the first stage in the proposed recognition system is preprocessing. The preprocessing algorithm is needed to remove variations present in the stroke captured by tablet or smart phone. These variations are mainly present in the form of size, slant, unwanted sharp edges and missing points, etc., so there is a persistent need for preprocessing stage after data collection. The five preprocessing phases in proposed algorithm are used in sequential order after the process of data collection; which are as following: resizing and centering, interpolating missing points, smoothing, slant correction and resampling of points [26,27,28].

4.1.1 Resizing and Centering

Resizing and centering phase of stroke is a necessary process that should be performed in order to recognize the stroke. This can be done by assuming a certain frame with a fixed size then moving the stroke to the assumed center point of the frame.

4.1.2 Interpolation

The interpolation phase is used since the stroke may have been written with high speed, so that missing points in the stroke will be found. These missing points can be calculated using various interpolation techniques such as Bezier and B-Spline. We have opted piecewise Bezier interpolation in our procedure because it helps to interpolate points among fixed number of points. In piecewise interpolation technique, a set of consecutive four points is considered for obtaining the Bezier curve. The next set of four points gives the next Bezier curve [29]. The pseudocode of interpolation phase is shown in Fig. 7.

4.1.3 Smoothing

Flickers do exist in handwriting because of individual handwriting style and the hardware used. These flickers can be removed by modifying each point of the list with mean value of k-neighbors and the angle subtended at position from each end, this phase is called smoothing phase.

4.1.4 Slant Correction

Slant correction is required to correct the shape of input handwritten character as most of the writers handwriting is bend to left or right directions. Slant correction for a stroke becomes complex as no baseline can be assumed. In case of single stroke, no bottom-line marks can be made. As such the chain code estimation method by Yimei [30] has been applied for slant correction in Arabic strokes.

4.1.5 Resampling

Due to variations in writing speed, the acquired points are not distributed evenly along the stroke trajectory. Resampling is used to get a sequence of points which is almost equidistant. Besides the removal of variations, this step is essentially because it reduces the number of points in a stroke to a certain value. After resampling, the data is significantly reduced and the irregularly placed data points that create jitter on the trajectory of the stroke are removed. This makes the resampling step very useful in noise elimination as well as data reduction. In this phase new data points are calculated on the basis of the original points of list. After this phase, only 64 equidistant points will be present in the stroke, those 64 points is of great importance in the next step in recognition system, feature extraction. Figure 8 clarifies the five phases of preprocessing after data collection.

Fig. 7.
figure 7

Algorithm for interpolation.

Fig. 8.
figure 8

Phases of preprocessing of an Arabic stroke.

4.2 Feature Extraction Stage

Feature extraction stage is one of the important stages in online handwritten character recognition, and selection of a feature extraction technique is an important task as efficiency of any online handwriting recognition system highly relies on the features which are considered as input to a classifier. There is no standard strategy for extracting features. Features that provide good results for one script may not provide good results for other scripts [31,32,33].

In the present study, we have presented two different techniques for feature extraction, one by just rearranging the preprocessed points without applying any transformation as shown in Fig. 9, while the second feature extraction technique is by applying Two-Dimensional Discrete Fourier Transform (2D-DFT) on the rearranged preprocessed points of the input stroke, Eq. 1 [34].

$$F\left[k,l\right]=\frac{1}{MN}\sum_{x=0}^{M-1}\sum_{y=0}^{N-1}f\left[x,y\right].{e}^{-2\pi j(\frac{kx}{M}+\frac{ly}{N})}$$
(1)

To reduce operations and computations we had used Fast Fourier Transform (FFT) instead of DFT, and after applying 2D-DFT, we got complex numbers as output. We had used experiments for both real part coefficients and imaginary part coefficients of these complex numbers as features and stored in a file, called feature file and this feature file is taken as input to the classifier.

4.3 Classification Stage

In machine learning and statistics, classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known [35, 36].

Fig. 9.
figure 9

Preprocessed points rearranged in a single row.

To evaluate the model after classification k-fold cross-validation is used, where the training data is divided into k parts; out of k parts, k-1 parts are used for training and remaining one part is used for testing. Each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set one time and used to train the model k-1 times [37, 38].

MATLAB Classification Learner application was used to train models to classify data, where we had used this application to perform automated training to search for the best classification model type, including decision trees, support vector machines, nearest neighbors, and ensemble classification. We had performed supervised machine learning by supplying a known set of input data which is our collected database and known responses to the data which is character stroke IDs. We had used the data to train a model that generates predictions for the response to new data [39,40,41].

Seven experiments were held to find the optimum accuracy, training time, and prediction speed:

  • Without Applying FFT.

  • Real part coefficients that was obtained after applying FFT is used as features and the feature file is taken as input to MATLAB classification learner app.

  • Imaginary Part Coefficients as features.

  • Real Part Coefficients normalized to 15.

  • Imaginary Part Coefficients normalized to 15.

  • Real Part Coefficients normalized to 100.

  • Imaginary Part Coefficients normalized to 100.

5 Optimum Proposed Recognition Model

Here we had held a comparison between all experiments that were achieved in to decide which model we will use in our recognition system. The comparison was held in terms of accuracy and prediction speed because they are the parameters that are needed in our recognition system, training time is not so important because the training is done just one time and is not needed then in recognition. First, a comparison with the six experiments that had applied FFT will be held as they are common in applying the same transformation on the preprocessed points, then the best of those will be parts of the next comparison against the remaining experiment. The first comparison indicates that for all experiments tree classifiers give high prediction speeds with lower accuracies, SVM classifiers give lower accuracies with medium prediction speeds, KNN classifiers give low accuracies with medium prediction speed and Ensemble classifiers give higher accuracies with medium or low prediction speeds.

The best results were achieved almost from experiment 2 “using real part coefficients of FFT”, also it is obvious that the best classifier learner among all classifiers of experiment 2 is SVM classifiers and Ensemble classifiers, Ensemble (Subspace KNN) classifier gives the highest accuracy (75.6%) but the prediction speed is so low (360 obs/sec), however Quadratic SVM gives a near accuracy of (74.4%) but with a better prediction speed of (1900 obs/sec). The other comparison that was held between experiment 2 using real part coefficients of FFT and experiment 1 without applying FFT is shown in Table 3. This comparison indicates that experiment 2 has better prediction speeds for almost all the classifiers, however experiment 1 gives more better accuracy. The highest accuracy from experiment 1 is for the Quadratic SVM classifier (86.4%) with a prediction speed of (1600 obs/sec). Notice that if we use another PC device in our experiments, the accuracy of models will still the same but the prediction speed and training time will differ according to the PC specifications. For example, when we used a PC with AMD A8–3870 CPU 3.00 GHz and 12.0 GB installed memory (RAM), the Quadratic SVM classifier gave an accuracy of 86.6% with a prediction speed 530 obs/sec. According to the previous comparisons it is clear that the optimum recognition model is the model from experiment 1 Quadratic SVM classifier with the accuracy (86.4%) in our recognition model.

5.1 Testing of the Optimum Model

After creating classification models interactively in Classification Learner, we can export our optimum model to the workspace or make a standalone application. We can then use the produced application to make predictions using new data. The application will follow the stages that were used in the training phase by collecting data using the GUI, preprocessing points collected and outputs a total of 64 points, extracting features by just rearrange the preprocessed points to a single row with 128 features representing the entered stroke, then with the help of the trained model structure the app will predict ID of the entered stroke. Figure 10 shows the recognition of an Arabic handwritten cursive word ( ) pronounced “Mohamad” after writing it with online handwriting letter by letter and predicts its characters by the proposed recognition model. Word Mohamad in Arabic consists of four letters. In the stylus pen simulator, an Arabic stroke is written by hand ( ) as shown in left-side in Fig. 10(a); the output of the code using Quadratic SVM model was ID22 with a prediction time of 0.896426 s. We export the output ID to a text file by first convert it to an Arabic character with identical shape. Therefore, the stroke ID22 will be converted to “meem” character ( ) as shown in right-side in Fig. 10(a). After that, by repeated the same sequence for the three remaining letters of word Mohamed shown from Fig. 10(b) to Fig. 10(d). These indicate the prediction of the three Arabic strokes “hha” with ID13, “meem” with ID22, and “dal” with ID14 respectively.

6 Conclusion

This paper presented novel Arabic handwritten characters and strokes databases. These databases are focused only on Arabic handwritten characters with Naskh style. A lot of work is needed from researchers to supply Arabic society with this kind of strokes databases; Ruqaa,

Table 3. Comparison between FFT based feature extraction and without applying FFT.

Thuluth, Diwani are some styles of Arabic language that are needed to be part of the future databases. Furthermore, collecting databases for shapes of Arabic characters depending on their locations in the word is quite needed. Moreover, databases of diacritics will be also of great importance for more advanced character recognition. More volunteers from different ages are needed to make a powerful database.

Fig. 10.
figure 10

The recognition of Arabic handwriting cursive word ( ) “Mohamad” by the proposed model: From (a) to (d) at left-side GUI used to enter a stroke, at right-side predicted character by the proposed recognition model and written in text file

Our study was based on mentioned machine learning technique using supervised learning as the database collected was with known stroke IDs and that is the cause of using classification, each stroke was given an ID and database was collected according to these IDs. The workflow for recognition was by collecting data, preprocess the data, derive features using preprocessed data, train models using features derived, iterate to find the best model, and then integrate the optimum-trained model into the recognition system.