Keywords

1 Introduction

As of July 2013, the United Nations website was available in six languages, while the official website of the European Union could be read in 24 languages. Furthermore, Google supported 90 languages, while Wikipedia supported 295 [25]. However, just a few of them have an above-average dispersion. As a consequence, the Web mainly employs texts written in the so-called world languages, such as English, French, German and Spanish (over 70% of all web content). In contrast, the vast majority, i.e., over 95% of the languages have already lost the capacity to ascend digitally [16]. A number of minority languages in Europe still exists despite the strong pressure from the majority languages such as English, French, German and Spanish. The later languages are clearly widespread and dominant languages in Europe and on the Web. As a consequence, most speakers of minority languages are bilingual or multilingual. Hence, minority languages are changing because the websites in these languages also have to include some of the widespread world languages, creating the concept of multilingual websites. Hence, the extraction of some text fragments in minority languages and their classification is a real challenge and worths investigation. If we take as an example of minority language the Serbian one, then it is used in around 0.1% of the websites [25].

Language identification is the process of language recognition in a certain text. Many methods have been proposed for language identification. They are classified into the following groups [13]: (i) Letter based approach, (ii) Word based approach, (iii) N-gram approach, and (iv) Language identification using a Markov model. Previous research has included the statistical analysis of the text content. In the letter based approach, the frequency distribution of certain letters or common letter combination is analyzed for making easier the language identification [23]. The problem is that the obtained results would be reliable if the number of words in the sentences was high (above 21). Word based approach uses only words up to a specific length or the most frequent word’s appearance for establishing the language model [12, 22]. The main limitation is the training phase needing a high number of documents to create the language model. Another technique generates a language n-gram model for each of the languages, extracting substrings of length n and computing their frequency in the text [4, 8]. A problem can occur when the pieces of input text are composed of several languages. Unfortunately, this approach cannot solve such a case. The methods in the last group use Markov models in combination with Bayesian decision rules to produce models for each language [9]. It is worth noting that the training process is computationally intensive, needing at least 50-100 K words for successfully testing small parts of text of length above 100 characters [18].

In this paper, an image-based method for language identification is proposed to overcome the limitations of the previous approaches. In fact, it has the following advantages: (i) it does not require a large piece of text for training, (ii) it is not computer time intensive, (iii) it is able to identify a minority language among the most widespread languages on the Internet, (iv) it needs unicode or prior to Optical Character Recognition (OCR) input, and (v) it is not dependent on the certain alphabet extension with specific letters. The first stage of the method is based on coding. It maps the initial text into a coded text established according to the energy characteristic of each letter based on its position in the text line. A similar approach with six established elements, which is not converted to image elements, was proposed in [21]. Still, this method uses a typical n-gram method for language discrimination. Unlike that, in our method the coded text, which includes four different script types, corresponds to an image with four grey levels. Then, it is subjected to co-occurrence statistical analysis in order to extract texture features. This way, the feature extraction and further text classification are performed in the image processing area. To test the proposed approach, an experiment is conducted on a custom-oriented dataset containing text mainly from Web documents (unicode format) given in different world languages: German, Spanish, English and French and minority language such as Serbian. The classification is performed by employing K-Nearest Neighbors and Naive Bayes algorithms. The differences in the feature values establish an important aspect for classification that can identify a minority language such as Serbian among the world languages (in unicode, i.e. web, PDF, or in a scanned text document). This presents a new application of the method which has not been investigated in the previous literature [1, 2]. Classification results are compared with the results obtained by the n-gram language model for identification of the minority language. This comparison confirms the superiority of the proposed method in language identification. It was also confirmed in a complex problem such as discrimination of evolving languages [3].

The remainder of the paper is organized as follows. Section 2 addresses all aspects concerning the proposed algorithm, including script coding, four grey level image definition, texture features extraction and classification. Section 3 discusses the experiment. Section 4 presents the classification results. Section 4 draws the conclusions and outlines future work directions.

2 Proposed Algorithm

The proposed algorithm is a multi-stage method that consists of the Following steps: (i) unicode text is mapped into the coded text according to the energy characteristics, (ii) creation of four grey level image that fully corresponds to the coded text, (iii) feature extraction by co-occurrence analysis, (iv) feature classification, and (v) identification of the language. Figure 1 shows the flow of the proposed algorithm.

Fig. 1.
figure 1

The flow of the algorithm.

2.1 Unicode Text Mapping

Typically, the text in the documents is divided into text lines. Each character can be separated according to its energy characteristic, i.e. horizontal projections profile. Figure 2 shows the different characters and their corresponding energy characteristics. We can realize that the height of different characters corresponds to their energy. Obviously, all characters have different energy characteristic. The most diffused energy is given by the characters that outspread over the full text line height such as character lj in Serbian or Croatian language. Taking into account all aforementioned, we can draw virtual lines in each text line. They can be represented as: (i) top-line, (ii) upper-line, (iii) base-line, and (iv) bottom-line. Furthermore, they establish three vertical zones [26]: (i) upper zone, (ii) middle zone, and (iii) lower zone. All letters can be classified according to these vertical zones. The short letters (S) take the middle zone. The capital letters and letters with ascenders (A) take the middle and upper zones. The descendent ones (D) occupy the middle and lower zones. The full letters (F) outspread over the upper, middle and lower zones. Figure 3 shows the script characteristics according to the letter baseline position.

Fig. 2.
figure 2

Energy characteristic of different characters: (a) different characters in different text lines, (b) their corresponding energy.

Fig. 3.
figure 3

Definition of the script characteristics according to baseline status of the text.

According to the vertical zone classification, all letters from the alphabet are substituted with the script types. Each letter is positioned into a given zone(s) in the text line. Consequently, it is mapped to a unique element of the set {S, A, D, F}.

2.2 Image Creation

In order to easily apply a statistical analysis, the script type should be injectively coded in the following way:

$$\begin{aligned} S \rightarrow 0, A \rightarrow 1, D \rightarrow 2, F \rightarrow 3. \end{aligned}$$
(1)

These codes can be transformed into the grey level pixels of an image. Hence, it corresponds to an image with four grey levels. Figure 4 shows an example of the coding procedure for a text sample given in German language.

Fig. 4.
figure 4

An example of coding procedure for a German text: (a) initial text, (b) extraction of characters, (c) coding according to script types, (d) image creation.

2.3 Feature Extraction

Currently, the initial text is transformed into an image through a variable reduction process. The obtained image is then subjected to the texture analysis. It includes the extraction of the co-occurrence probabilities, which provide second-order texture features [14]. These features are extracted from the image in two steps. At the first step, the pairwise spatial co-occurrences of pixels separated by a particular angle \(\theta \) and distance d are tabulated using a Grey Level Co-occurrence Matrix (GLCM). At the second step, a set of texture measures is calculated from GLCM. The GLCM shows how often different combinations of grey level co-occur in a part or in the whole image [14].

Let’s suppose that our grey scale image is given as I, featuring M rows, N columns, and T number of grey levels. GLCM represents the spatial relationship of grey levels in the image I. It is a \(T \times T\) square matrix. To compute GLCM C, a central pixel I(xy) with a neighborhood defined by the Window Of Interest (WOI) is taken. WOI is defined by inter-pixel distance d and orientation \(\theta \). Hence, for the given image I, GLCM C is defined as [10]:

(2)

where i and j are the intensity values of the image I, x and y are the spatial positions in the image I, the offset \((\varDelta x, \varDelta y)\) is the distance between the pixel-of-interest and its neighbor. It should be noted that the offset depends on the direction \(\theta \) that is used and the distance d at which the matrix is computed.

In our case, the neighborhood is given as 2-connected only, due to the nature of the text. Accordingly, \(\theta \) is \({0^{\circ }}\), while d is typically used as first neighborhood, i.e. \(d = 1\). Then, the normalized matrix P of GLCM C is calculated as [5]:

$$\begin{aligned} P(i,j)=C(i,j)/ \sum _{i}^{T}\sum _{j}^{T}C(i,j). \end{aligned}$$
(3)

Still, GLCM provides only a quantitative description of the spatial patterns. Hence, it is not used for practical image analysis. Ref. [14] proposed a set of texture measures, which summarize the information from GLCM. Although a total of 14 quantities, i.e. features was originally proposed, only subsets of them are used [17]. These are the following twelve GLCM texture measures: (i) mean value \(\mu _x\), (ii) mean value \(\mu _y\), (iii) standard deviation \(\sigma _x\), (iv) standard deviation \(\sigma _y\), (v) correlation, (vi) energy, (vii) entropy, (viii) maximum, (ix) dissimilarity, (x) contrast, (xi) inverse difference moment and (xii) homogeneity. The twelve co-occurrence elements are shown in Table 1. Hence, after this phase, the four grey level image is represented by a 12-dimensional feature vector.

Table 1. Twelve co-occurrence elements.

2.4 Feature Classification

In order to classify the obtained feature vector, the K-Nearest Neighbors and Naive Bayes methods have been used, which are well-known algorithms for data classification.

K-Nearest Neighbors. K-Nearest Neighbors (K-NN) is a very easy approach to classify feature vectors [7, 11]. Let Tr be the training set composed of n feature vectors with associated class labels and \(x_t\) be a test feature vector to classify. Classification of \(x_t\) is performed by computing the distance between \(x_t\) and each training vector in Tr from 1 to n. The K training vectors \(Tr'\) which are the nearest to \(x_t\) are finally considered. K is a fixed parameter of the algorithm, determining the amplitude of the neighborhood. The predicted class label for \(x_t\) is the one occurring most frequently in \(Tr'\). Because even values of K can determine class labels with the same frequency in \(Tr'\) [19], the value of the K parameter is usually fixed to a small odd integer. When the instances are fixed-length vectors of real-value features, the distance function often adopted for K-NN is the Euclidean one. However, different distance functions can determine variations in the similarity evaluation [19]. In fact, based on the chosen distance function, two vectors \(x_i\) and \(x_j\) can be considered more or less similar to each other. Consequently, other possible functions are selected for K-NN, such as the Manhattan distance or the Chebyshev distance.

Naive Bayes. Naive Bayes (NB) classifier is a probabilistic learning method based on the assumption that all variables are mutually independent, given the class variable [20]. The classifier is defined as: \(f_{nb}(x_i)=\frac{p(Y=1)}{p(Y=0)}\prod _{k=1}^h \frac{p(x_i^k|Y=1)}{p(x_i^k|Y=0)}\), where \(x_i=\{x_i^1,...,x_i^h\}\) represents a vector of h features and class variable Y. In order to classify the test feature vector \(x_i\) in class 1 or class 0, the probability of each of its features conditioned to class 1 or 0 and the probability of occurrence of class 1 and 0 in the training set are computed. \(x_i\) is predicted to be in class 1 if and only if \(f_{nb}(x_i) \ge 1\). Otherwise, it is predicted to be in class 0.

For numerical features, the normal distribution is considered for computing the probability values:

$$\begin{aligned} f(w,\mu , \sigma )=\frac{1}{\sqrt{2\pi }\sigma }{e^{-\frac{(w-\mu )^2}{\sigma ^2}}}. \end{aligned}$$
(4)

Accordingly, \(p(x_i^k|Y=1)=f(x_i^k,\mu _{y_i}, \sigma _{y_i})\), where \(\mu _{Y=1}\) and \(\sigma _{Y=1}\) are respectively the mean and standard deviation of the values of k-th feature with class 1.

3 Experiment

A test is conducted to evaluate the quality of the proposed method in correctly identifying a minority language among a set of world languages. Accordingly, a custom-oriented dataset extracted from the web text given in unicode format of different languages is employed. It represents a set of text excerpts in German, Spanish, English, French and Serbian languages. It consists of a total of 150 texts, divided into two classes respectively of 25 Serbian and 125 world language texts (German, Spanish, English and French). The class label for each text in the dataset corresponds to Serbian or world language. All texts have different contents and size. The size of the text excerpts is between 378 and 2822 characters. The length of the texts is chosen according to the size standard in factor analysis, which means that the total number of analyzed elements would be higher than 300 [24]. It means that our text samples contain approx. more than 300 characters. Figure 5 illustrates a web page in Serbian language from which the unicode text is extracted.

Fig. 5.
figure 5

Web page sample in Serbian language.

The proposed features are extracted from each text in the dataset, in order to create 150 feature vectors. Then, K-NN and NB algorithms are adopted on the feature representation of the dataset for identification of the minority language. Finally, classification results are compared and discussed.

Fig. 6.
figure 6

The twelve GLCM features obtained by the dataset for the world and Serbian languages in a min-max manner

4 Results and Discussion

Figure 6 and Table 2 show the GLCM features obtained by the dataset for the world and Serbian languages in a min-max manner. \(\mu _x\) and \(\mu _y\) are the same as well as \(\sigma _x\) and \(\sigma _y\) count on two decimal places. Hence, only one graph representing both is enough. It is worth noting that Serbian language can mostly be discriminated from the world languages by the GLCM dissimilarity and contrast. In fact, the overall characteristics of Serbian text have much smaller values of dissimilarity and contrast.

Table 2. GLCM feature values from Fig. 6 in the min-max manner.

Because the classification accuracy depends on the choice of training and test sets, the dataset is processed by a k-fold cross-validation strategy [15], for obtaining different results on multiple training and test sets. The dataset is randomly divided into k folds. Then, each fold is considered as the test set and the remaining \(k-1\) folds are considered as the training set. Each fold has roughly equal dimension and roughly the same language class proportions as in the dataset. Consequently, the test set is composed of a small number of texts in Serbian and a number of texts in world languages. Model learning by K-NN and NB is performed each time by using the current training set, then classification evaluation is established on the current test set. The K value of the K-Nearest Neighbor has been fixed to small odd values (see Sect. 2.4), i.e. 1, 3 and 5. Furthermore, the K-NN algorithm has been executed with the traditional Euclidean distance, which revealed to be particularly reliable in this context with respect to other distance measures. Because the feature vectors have numerical values, probabilities of the NB have been computed by using (4).

Our task is in the domain of binary classification, because we need to classify a model and correctly predict the classes that represent texts written in the world languages (German, Spanish, English and French) and in minority Serbian language. The problem of language classification and identification is similar to the information retrieval one. Hence, precision, recall and f-measure are preferred metrics for the evaluation of the proposed algorithm [6]. They are calculated from the confusion matrix between the classification results obtained by the test set and the ground truth partitioning of the test set in Serbian and world languages. Performance measures have been computed for each selection of the test and training folds and the average values together with the standard deviation have been reported for each of the measures. Furthermore, k-fold cross-validation has been repeated separately for three different values of k equal to 2, 5 and 10 [15]. Finally, for avoiding the dependence of the classification results from the particular division in folds, k-fold cross-validation has been executed 50 times for each value of k.

The classification results obtained by the proposed method using K-NN or NB classifier are very positive. Evaluation reveals a perfect identification of the Serbian texts in the test set. In fact, precision, recall and f-measure obtain a value of 1 in all the cases, when the k value of fold cross-validation is equal to 2, 5 and 10 for all the 50 runs and when the K value of the Nearest Neighbor classifier is fixed to 1, 3 and 5. The classification results of the proposed method are compared with those obtained by the n-gram language model. In particular, each text in the dataset is represented by the normalized frequency values of the extracted bi-grams. The same classification experiment is performed with the bi-gram feature vectors by adopting K-NN and NB algorithms and k-fold cross-validation. Also, the same parameter values for the classifiers are selected in the experiment with bi-grams.

Table 3 reports the classification results of bi-grams with the K-NN classifier and Euclidean distance at the different values of \(k=2, 5,10\) folds and with \(K=1, 3, 5\).

Table 3. Average results in terms of precision, recall and f-measure, together with the standard deviation (in parenthesis), obtained by bi-grams and K-NN classifier using k-fold cross-validation.

Although precision, recall, and f-measure are in the range of 0.91–1.00, it is worth noting that bi-grams are not able to obtain the perfect identification of the minority language among the world languages in all the cases. In fact, in 10-fold the f-measure value for Serbian class is in the range 0.98–1.00. However, in 5-fold it varies in the range 0.98–0.99. Finally, in 2-fold the f-measure value is in the range 0.95–0.99, depending on the K value.

Table 4 shows the classification results of bi-grams with the NB classifier at the different values of \(k=2, 5,10\) folds. We may observe a poor classification result w.r.t that obtained by our method. In particular, the method is mostly poor in recognition of the minority Serbian language, with the highest f-measure value of 0.44 in the 2-fold. Also, classification of the world languages is not perfect, reaching a peak of 0.95 in the 2-fold.

Table 4. Average results in terms of precision, recall and f-measure, together with the standard deviation (in parenthesis), obtained by bi-grams and NB classifier using k-fold cross-validation.

Figure 7 illustrates the f-measure values obtained by our method and by bi-grams using K-NN and NB classifiers for the different k-folds. Bars represent the f-measure values obtained by bi-grams. The black dashed lines represent the value of f-measure equal to 1 obtained by our method in all cases. This graphical comparison confirms that our method outperforms the competitor in most cases.

Fig. 7.
figure 7

F-measure values obtained by our method and by the bi-grams feature representation. Bars represent the f-measure values obtained by bi-grams. The black dashed lines represent the value of f-measure equal to 1 obtained by our method in all cases

These results indicate that the bi-grams are not totally robust and reliable in the identification of a minority language such as Serbian among the world languages. Also, the tri-grams did not obtain a meaningful improvement w.r.t. bi-grams, similarly as in [3]. Consequently, their results will be omitted. On the contrary, the proposed feature representation demonstrated to be robust and effective in solving the same task on the same dataset and in the same operating conditions when two different machine learning algorithms are employed. Hence, it is a very promising approach in language identification.

The experiment has been performed in MATLAB R2015a, on a Desktop computer quad-core 2.3 GHz CPU with 8 GB RAM and operating system Windows 7. Based on these specifications, the CPU time for the feature extraction procedure is below 0.1 s. Again, the CPU time for feature classification by K-NN and NB is around 0.003 s. Obviously, the method has proven to be computationally non-intensive.

5 Conclusion

This paper proposed an image-based method for minority language identification by considering statistical analysis of the text based on the text-line status of each script element. The statistical analysis was performed by the grey level co-occurrence matrix. Due to the difference in the language characteristics, the results of the statistical analysis in terms of features showed significant dissimilarity. Classification of the introduced features was performed by the well-known supervised learning algorithms K-Nearest Neighbors and Naive Bayes. The proposed method was tested on text excerpts from a custom oriented dataset. It incorporated texts given in German, Spanish, English, French (world languages) and Serbian (minority language). The experiments showed encouraging results: minority language was perfectly classified by the proposed method. Furthermore, a comparison with the n-gram language model demonstrated the superiority of the proposed feature representation in minority language identification. The research presented in this manuscript can be used for language identification on the Web, in preprocessing steps of OCR, and for video text identification.

Future work will enlarge the dataset with more complex samples and will employ other well-known classification algorithms for the experiment, such as deep learning based approaches.