Online Recognition of Arabic Handwritten Words System Based on Alignments Matching Algorithm

Abuzaraida, Mustafa Ali; Zeki, Akram M.; Zeki, Ahmed M.

doi:10.1007/978-981-10-2772-7_5

Mustafa Ali Abuzaraida⁵,
Akram M. Zeki⁶ &
Ahmed M. Zeki⁷

406 Accesses
3 Citations

Abstract

Arabic language is considered as the primary language in most parts of the Arabic world. It is spoken as a first language by more than 280 million people, and more than 250 million as a secondary spoken language. In pattern recognition field, several studies were focused on Arabic language with textual or voice methods. In this paper, an online handwritten Arabic text recognition system using an alignment matching theory is presented. The proposed system deals with the handwritten words as one block instead of segmenting the words into characters or strokes. The system started with collecting the dataset of 120 common Quranic words. These words have been gone via some phases to be ready for use. These phases are: Preprocessing, features extraction, and recognition phase. In the first phase, the words went through some steps to be standardized. The second phase is about extracting the features of each word and to save them in the system database. In the third phase, the system uses matching technique to search for the testing word with the system database. The system was tested and the results reached up to 97 %, which were significantly accepted compared to the previous works in the same criteria.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Recognition of Arabic Handwritten Text by Integrating N-gram Model

Automatic recognition of handwritten Arabic characters: a comprehensive review

Article 17 July 2020

Aolah Databases for New Arabic Online Handwriting Recognition Algorithm

Keywords

1 Introduction

The Arabic language is considered as the primary language in most parts of the Arabic world. It is spoken as a first language by more than 280 million people, and more than 250 million as a secondary spoken language. Therefore, the Arabic language is one of the most widely spoken languages in the world. In 2010, Arabic was ranked in the top five of the commonly spoken languages worldwide [1]. On the other hand, many other languages around the world are similar to the Arabic language [2]. These languages follow the Arabic language in the writing style and also in the way of speech. Many of these languages are the main language in Islamic countries like Persian in Iran; Jawi in Indonesia, Malaysia, and Brunei; Urdu in Pakistan; Pashto in Afghanistan; Bengali in Bangladesh; and others [3].

Looking at the Arabic text characteristics, there are differences between the Arabic texts and text from other languages with respect to the formatting and the way of writing. The written form of the Arabic language is summarized as follows: The 28 Arabic characters are written in different formatting. The character location in the word gives the character its formatting shape. In the Arabic text, there are four shapes of each character which are defined as the starting, middle, end, and isolated shape [4].

The Arabic word must be written cursively and the characters connect horizontally to give an understandable text [5, 6] as shown in Fig. 5.1.

The general objective of this research is to design an online handwritten Arabic text recognition system by using an alignment matching theory for recognizing handwritten Arabic words.

2 Architecture of the Proposed System

The proposed system followed the typical pattern recognition system architecture that contains four main phases which have been identified as text acquisition, preprocessing, feature extraction, and recognition phase as illustrated in Fig. 5.2 [8–10]. However, the segmentation step is not included in the system and the handwritten word will be processed as one block. This segmentation-free strategy can minimize the time process, help to overcome the segmenting overlapped characters’ problem, and can enhance the rate of accuracy of recognition. Although having stated that, each phase of the system has one or more objectives in order to reach the goal of the system and also to enhance the overall recognition accuracy rate.

2.1 Data Collection Stage

The data collection stage is the initial step of any pattern recognition system and aims to get raw data which will be used later by way of training and testing [11]. In this stage, the handwritten text is written by writing on an interface device that converts the handwritten text to time stamped coordinates of the stylus trajectory (x, y).

Here, for the purposes of collecting the training and testing databases [12], a 1.5 GHz core i3 Acer Tablet has been used to collect the dataset. This computer has a touch screen which can easily be used to acquire the Quranic handwritten words by a simple way of normal writing on the touch screen using a special stylus as illustrated in Fig. 5.3. The method of writing on the Tablet can minimize the noise and errors while recording on the Tablet’s surface.

For collecting Quranic handwritten words, a platform was designed using a Matlab environment with a graphical user interface (GUI). Data collection from the computer Tablets using this natural writing way can provide data which is identified as closely resembles, smoothed, and filtered. Figure 5.4 shows the data collection platform.

The next stage involved testing the system where the same procedure of training is performed. Global Alignment Algorithm (GAA) is used to match every handwritten word in the testing dataset with the whole training dataset. Accuracy rate and processing time is recorded. The most three highest accuracy words are presented. All these steps are explained in details in the following subsections.

2.2 Preprocessing Phase

The preprocessing phase is performed to minimize the noise which may occur in the handwritten text [13]. This phase includes several multiple steps and each step performs a specific function to filter the dataset. Besides that, it could improve the overall recognition rate, which is considered one of the essential phases of online handwriting recognition and most of the researchers have discussed its challenges in relation to the various texts from time to time [4, 14, 15].

Generally, the data collection for the online handwriting recognition system is made by storing the stylus movements on the writing surface. These movements are distributed at various positions on the writing area of the acquisition platform and then joined from the first position (x ₁, y ₁) to the last (x _n, y _n) to present the appearance of drawn text. Specifically, the stylus movements consist of three actions which comprise: pen-down, pen-move, and pen-up actions. The serial of points are collected when the writer presses, moves, and lifts the stylus up, consecutively. The pen-move function records the movements of the stylus on the writing Tablet from the writing starting point (x ₁, y ₁) until the last point (x _n, y _n) where n is the total number of points in the writing movements’ list [16].

After recording the series of stylus movements, four essential steps were then performed in the preprocessing phase for this online handwritten Arabic text recognition system. These preprocessing steps are discussed in the following subsections:

Word Smoothing: In the proposed system, a smoothing technique was used to smooth the handwritten curves and this step is referred to as the Loess filter. This filter is based on conducting the local regression of the curves’ points using a technique of a weighted linear least squares and a second degree polynomial model.

In this technique, each smoothed value is determined locally by neighboring data points defined within the writing curve. The process is weighted and a regression weight function is defined for each data point contained within the writing curve. The local regression smoothing algorithm is presented in the three steps indicated below for each data point [17].

Firstly, the regression weights for each data point in the writing curve by the tricube formula, calculated by using the equation below.

$$ Wi = \left( {1 - \left| {\frac{x - xi}{{{\text{d}}(x)}}} \right|^{3} } \right)^{3} $$

(5.1)

where x is the predictor value associated with the response value to be smoothed, xi is the nearest neighbor of x as defined by the curve, and d(x) is the distance along the x-axis from x to the most distant predictor value within the curve. The weights have these characteristics. Accordingly, the data point to be smoothed has the largest weight and the most influence on the fit. Furthermore, data points outside the curve have zero weight and no influence on the fit.

Secondly, a weighted linear least squares regression is performed. Here, for the Loess method, the regression is based on a second degree polynomial.

Finally, the smoothed value is given by the weighted regression at the predictor value of interest.

Word Simplification: Douglas Peucker’s algorithm [18] was adopted in this system to simplify the acquired handwritten word point sequence. Specifically, Douglas Peucker’s algorithm is undertaken by considering an imaginary line between the first and the last point in a set of a curve points. The algorithm then checks which point is the furthest away from this line segment with the first and last points considered as end points. Although, if the point or all the other in-between points are closer than a given distance, it removes all these in-between points. However, if this outlier point is farther away from the imaginary line than a specific value known as a “tolerance”, the curve is split into two parts. Here, Douglas Peucker’s algorithm has been applied with a tolerance of 0.01 which is determined empirically.

Word Size Normalization: The size of the acquired handwritten word depends upon the way in which the writer moves the stylus on the designated writing area. The handwritten words are generally written in different sizes when the stylus is moved along the border of the writing area and this may cause some ambiguity in the next phases. Following on from that, size normalization is a necessary step that ought to be performed in order to recognize any type of text. This can be achieved by converting the acquired handwritten word with an assumed fixed-size format.

Centering of the Word: After resizing the acquired handwritten word, it is necessary for the current coordinates to be shifted to the centering axis (X0, Y0) to ensure that all points of the handwritten words are equal in formatting and to make certain that all the data is translated to the same spot relative to the origin. This step is undertaken using the following algorithm.

After passing the four steps of preprocessing phase, the points of the handwritten words are almost in a standard format. However, in this proposed system, a series of simple and less number of steps were performed to eliminate the complexity and to minimize any processing delay that may occur.

2.3 Features Extraction Phase

The proposed system takes the stylus trajectory directions as the main feature representing handwriting movements. Freeman’s code is used to create the direction matrix for each handwritten word. Furthermore, Freeman’s code [19] represents the directional movement of the stylus by a numeric code consisting of 8 digits. These directions are listed from 1–8 to represent the eight main writing directions as illustrated in Fig. 5.5.

2.4 Recognition Phase

In this study, a matching algorithm called GAA was used as the recognition engine to recognize the Arabic handwritten words. After conducting this phase, the system can classify the proper word from the dataset of the system. In the following section, more details will be presented to explain the approach of the GAA method.

In fact, the most well-known and widely used methods for sequences alignments are the Local Alignment Algorithms and the Global Alignment Algorithms (GAAs). The GAA was developed by Saul B. Needleman and Christian D. Wunsch in 1970 [21]. Here, the alignment is carried out from the beginning until the end of the matched sequence to find out the best possible alignment [22].

GAA is basically a dynamic programming algorithm for sequence alignment. This dynamic programming can solve the original problem by dividing it into smaller independent subproblems. The algorithm explains global sequence alignment for aligning the nucleotide or protein sequences in general.

In general, dynamic programming is used to find the optimal alignment of two sequences. It finds the alignment in a quantitative way by giving score values for matches and mismatches. The alignment is accurately obtained by searching the highest scores in the matrix [23]. The procedure of GAA is explained in detail as following:

For matching two amino acid sequences, the algorithm is designed to find the highest score value of the sequences by building a two dimensional matrix. Basically, the algorithm procedure is defined with the three following steps in mind:

Assuming an initialization score matrix with the possible scores;
Filling the matrix with maximum scores; and
For appropriate alignment, tracing back the previous maximum scores.

In the proposed system, the GAA uses the default values of 0, 1, and 1 for gap penalty, mismatching penalty, and matching score, respectively.

3 Presentation of the Results

For testing the system, 2400 handwritten Arabic words were fed into the system for recognition. These words were written by 40 writers who did not have any prior experience of writing by way of stylus on a digital surface. Each writer was asked to write 60 words of the same words of the dataset. Accordingly, each word was then written 20 times in total. The phases of the system were then applied to the testing dataset and then applied to the system’s database for matching.

As a result of the GAA of matching every testing word with the system database, the system returns the word which gives the highest matching score and matches the sequence of the word examined. Furthermore, the matching algorithm is modified to give the first three highest scores of the first three words that match the word sequences analyzed.

4 Summary and Conclusion

The main goal of this research was to investigate the way of building an online Arabic handwriting recognition system using combination techniques for each phase of the proposed systems. The research also aimed to define how well the proposed system is able to resolve the Arabic handwriting recognition complexities.

In this study, the database contained 12,000 handwritten words. These handwritten words included more than 42,800 characters and 23,300 sub words written in different styles. A matching algorithm was used as a recognizer method using a global feature to describe the words. However, no segmentation step was included in the system.

The results of the experiment were statistically significant in comparison to the handwritten text recognition accuracy rates obtained from past works of online Arabic systems. Here, the results identified an accuracy of approximately 97 % in experiment I with an average of processing time about 3.034 s.

References

World 100 Largest Language in 2010, http://www.ne.se/spr%C3%A5k/v%C3%A4rldens-100-st%C3%B6rsta-spr%C3%A5k
Abuzaraida, M.A., Zeki, A.M., Zeki, A.M.: Difficulties and challenges of recognizing arabic text. In: Computer Applications: Theories and Applications. IIUM Press Malaysia, Kuala Lumpur (2011)
Google Scholar
Versteegh, K., Eid, M., Elgibali, A., Woidich, M., Zaborski, A.: Encyclopedia of Arabic Language and Linguistics, vol. 1. Leiden Brill, Boston, USA (2006)
Google Scholar
Al-A’ali, M., Ahmad, J.: Optical character recognition system for Arabic text using cursive multi-directional approach. J. Comput. Sci. 3, 549 (2007)
Article Google Scholar
Zeki, A.M., Zakaria, M.S.: Challenges in recognizing Arabic characters. In: The National Conference for Computer. Abu-al-Aziz king University, Saudi Arabia (2004)
Google Scholar
Almohri, H., Gray, J.S., Alnajjar, H.: A real-time DSP-based optical character recognition system for isolated Arabic characters using the TI TMS320C6416T. In: Proceeding of the 2008 IAJC-IJME International Conference, pp. 25–35 (2008)
Google Scholar
Abuzaraida, M.A., Zeki, A.M.: Segmentation techniques for online Arabic handwriting recognition: a survey. In: Proceeding of the International Conference on Information and Communication Technology for the Muslim World (ICT4M), pp. D37–D40. Jakarta, Indonesia (2010)
Google Scholar
Hosny, I., Abdou, S., Fahmy, A.: Using advanced hidden Markov models for online Arabic handwriting recognition. In: Proceeding of the First Asian Conference on Pattern Recognition (ACPR), pp. 565–569 (2011)
Google Scholar
Potrus, M.Y., Ngah, U.K., Sakim, H.A.M.: An effective segmentation method for single stroke online cursive Arabic words. In: Proceeding of the International Conference on Computer Applications and Industrial Electronics (ICCAIE, 2010), pp. 217–221 (2010)
Google Scholar
Abuzaraida, M.A., Zeki, A.M., Zeki, A.M.: Recognition techniques for online Arabic handwriting recognition systems. In: Proceeding of the International Conference on Advanced Computer Science Applications and Technologies (ACSAT2012). Kuala Lumpur, Malaysia (2012)
Google Scholar
Plamondon, R., Srihari, S.N.: Online and off-line handwriting recognition: a comprehensive survey. In: IEEE Trans. Pattern Anal. Mach. Intell. 22, 63–84 (2000)
Google Scholar
Abuzaraida, M.A., Zeki, A.M., Zeki, A.M.: Online database of Quranic handwritten words. J. Theor. Appl. Inf. Technol. 62 (2014)
Google Scholar
Abuzaraida, M.A., Zeki, A.M., Zeki, A.M.: Online recognition system for handwritten Arabic mathematical symbols. In: Proceeding of the Second International Conference on Advanced Computer Science Applications and Technologies (ACSAT2013). Kuching, Malaysia (2013)
Google Scholar
Razzak, M.I., Anwar, F., Husain, S.A., Belaid, A., Sher, M.: HMM and fuzzy logic: a hybrid approach for online Urdu script-based languages’ character recognition. Knowledge-Based Syst. 23, 914–923 (2010)
Google Scholar
Harifi, A., Aghagolzadeh, A.: A new pattern for handwritten Persian/Arabic digit recognition. World Acad Sci Eng Technol 3, 249–252 (2005)
Google Scholar
Abuzaraida, M.A., Zeki, A.M., Zeki, A.M.: Problems of writing on digital surfaces in online handwriting recognition systems. In: Proceeding of the 5th International Conference on Information and Communication Technology for the Muslim World (ICT4M). Rabat, Morocco, pp. 1–5 (2013)
Google Scholar
Loader, C.: Local Regression and Likelihood, vol. 47. Springer, New York (1999)
Google Scholar
David, D., Thomas, P.: Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica Int. J. Geogr. Inf. Geovisualization 10, 112–122 (1973)
Article Google Scholar
Herbert, F.: Computer processing of line-drawing images. ACM Comput. Surv. 6, 57–97 (1974)
Article Google Scholar
Abuzaraida, M.A., Zeki, A.M., Zeki, A.M.: Feature extraction techniques of online handwriting Arabic text recognition. In: Proceeding of the 5th International Conference on Information and Communication Technology for the Muslim World (ICT4M), pp. 1–7. Rabat, Morocco (2013)
Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)
Article Google Scholar
Durbin, R., Wddy, S., Korgh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)
Book Google Scholar
Jones, N.C., Pevzner, P.A.: An introduction to bioinformatics algorithms, illustrated ed. Massachusetts Institute of Technology Press, Cambridge, MA/London (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Faculty of Information Technology, Misurata University, Misrata, Libya
Mustafa Ali Abuzaraida
Kulliyyah of Information and Communication Technology, International Islamic University Malaysia, Kuala Lumpur, Malaysia
Akram M. Zeki
Department of Information Systems, College of Information Technology, University of Bahrain, Sakhir, Kingdom of Bahrain
Ahmed M. Zeki

Authors

Mustafa Ali Abuzaraida
View author publications
You can also search for this author in PubMed Google Scholar
Akram M. Zeki
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed M. Zeki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mustafa Ali Abuzaraida .

Editor information

Editors and Affiliations

Universiti Teknologi MARA, Kedah, Merbok, Malaysia
Abd-Razak Ahmad
Universiti Teknologi MARA, Kedah, Merbok, Malaysia
Liew Kee Kor
Universiti Teknologi MARA, Kedah, Merbok, Malaysia
Illiasaak Ahmad
Universiti Teknologi Mara, Kedah, Merbok, Malaysia
Zanariah Idrus

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abuzaraida, M.A., Zeki, A.M., Zeki, A.M. (2017). Online Recognition of Arabic Handwritten Words System Based on Alignments Matching Algorithm. In: Ahmad, AR., Kor, L., Ahmad, I., Idrus, Z. (eds) Proceedings of the International Conference on Computing, Mathematics and Statistics (iCMS 2015). Springer, Singapore. https://doi.org/10.1007/978-981-10-2772-7_5

Download citation

DOI: https://doi.org/10.1007/978-981-10-2772-7_5
Published: 24 November 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2770-3
Online ISBN: 978-981-10-2772-7
eBook Packages: EducationEducation (R0)

Publish with us

Policies and ethics