Keywords

1 Introduction

One of integral parts of language teaching is reading, which gives some technical difficulties for professors and students. These difficulties are mainly connected with searching and understanding texts of a concrete level of difficulty (corresponding to a student’s knowledge). At the moment there are several research projects that focus on the obtainment of text with the readability level needed for the education purposes. First approach is to classify texts with respect to its level and to retrieve the text needed. Second approach is to take any text and simplify it to the target readability.

This paper describes the part of a project which aim is to develop a system with a simplified functionality. It should be a system of text adaptation to a target level in Russian language as a foreign language (RFL). In the framework of project realization of the automatic simplification of texts in accordance with the language level, we were solving the identification problem of the source and the resulting levels of difficulty of the sentences or texts. Further step will be their lexical and syntactic simplification. In this study we give the results of application of a number of models that identify the level of difficulty of the text or single-sentences using different statistical parameters.

In Sect. 2 Related work we presented an overview of the work of the researchers involved in the subject classification of the texts on the basis of complexity of reading in Russian, English and French. In Sect. 3 Text readability prediction there are classical models and a model developed specially for the Russian language were adapted to our resources and was tested on texts of several levels of difficulty. Section 4 Sentence classification describes the results obtained by applying the formula of Flesch-Kincaid and Dale-Chall to identify the lexical and structural complexity of the Russian sentences. In Sect. 5, Sentence classification with syntactic features we give one of the variants of the model for the effective identification of readability of Russian sentences with the use of syntactic features. In Sect. 6 Conclusion there are general conclusions on the executed experiments and plans for further work and improvement of the models considered.

2 Related Work

The first studies on text complexity started in the 20’s of the past century. This field of research was mainly developed in the field of work relating to the English language, but over the last decade a number of works related to other languages were worried out, that testifies to the fact, that the research relating to automatic identification of the complexity of the text is still highly relevant.

The complexity of reading can be represented as a function that finds a correspondence to a certain level of complexity of the predefined text with a variety of variables, extracted from the text. Traditionally variables allocated for the characterization of these texts are divided into two groups – lexical parameters and syntactic parameters. In one of the most common formula – the formula of Flesch-Kincaid [1, 2] the complexity of the text is represented as a linear function of the average number of syllables per word and the average length of the sentence.

A formula of Dale and Hall [3] also defines a syntax difficulty of the text as the average length of the sentence, but for lexical metric it uses the percentage of words not from list of 3000 Easy Words, which is based on familiarity of words. This means that all the words in the list are familiar to US children in the 4th grade.

With the growth of computing power there appeared an opportunity to build more complex models. Model of Collins-Thompson and Callan (2005) [4] uses frequency of words unigrams (a dictionary is specified for each level of the language) and features that some words are most possible for prediction a certain level of complexity of the text. Schwarm and Ostendorf [5] use more complex syntax parameters - the average height of trees parsing, the number of nominal and verbal groups, the average number of non terminal nodes and so on.

Automatic identification of a reading difficulty in Russian language is also researched in a number of works. Oborneva (2006) in her work [6] adapts the formula of Flesch and Flesch-Kincaid for the Russian language by means of adjustment coefficients: she compares the average length of syllables in English and Russian words and percentage of multi-syllable words in dictionaries for these languages. It is also worth noting the study of Krioni, Nickin and Philippova who define the complexity of educational texts in Russian language highlighting a number of more complex parameters of assessed texts: connectivity, structure, integrity, functional and semantic type, information, abstractness of the text presentation and complexity of linguistic structures [7].

Due to the large amount of research dedicated to readability assessment, we have highlighted only the most eminent works. Nevertheless all of them identify the difficulty of reading of the whole text. Our goal is to determine the efficiency of the developed techniques in relation to the texts in general and sentences in particular in Russian language as well as checking our own developed model to determine the difficulty of sentence reading.

3 Text Readability Prediction

First task was to perform the prototyping of Russian text retrieval with needed readability. The main goal of this process was to find which kind of variables and classification algorithm would allow us to obtain the highest indicators of precision and recall of readability prediction. There was conducted a series of experiments on the training of different classification algorithms. We experimented with the following algorithms:

  • naive Bayes;

  • k-nearest neighbors;

  • classification tree;

  • random forests;

  • SVM.

Evaluation was performed with the help of cross validation on the test part of our collection. We extract features from a collection consists of 219 texts divided into four groups. Levels distribution is following: A1 (elementary – 52), A2 (basic) – 57, B1 (first) – 60, C2 (difficult) – 50 according to levels described in Common European Framework of Reference for Languages (CEFR) [8]. The first three groups include texts, created specially for second language learners of Russian, with respect to their level of language knowledge on the basis of news articles.Footnote 1 Fourth group (difficult) consists of original news for native readers. We extract 25 variables from texts proposed in the previous works:

  1. 1.

    Average number of words in the sentence of the text.

  2. 2.

    Average length of one word in a sentence.

  3. 3.

    Text length in letters.

  4. 4.

    Text length in words.

  5. 5.

    Average sentence length in syllables.

  6. 6.

    Average length of words in syllables.

  7. 7.

    Percentage of words with number of syllables more or equal to N. We define N as each value from 3 to 6.

  8. 8.

    Average sentence length in letters.

  9. 9.

    Average length of words in letters.

  10. 10.

    Percentage of words with number of letters more or equal to N. We define N as each value from 5 to 13.

  11. 11.

    The percentage of words in a sentence, not included in the active vocabulary of A1 level.

  12. 12.

    The percentage of words in a sentence, not included in the active vocabulary of A2 level.

  13. 13.

    The percentage of words in a sentence, not included in the active vocabulary of B1 level.

  14. 14.

    The occurrence in the sentence of concrete parts of speech.

We mark seventeen parts of speech in the texts according to the list of grams in the OpenCorpora [9]: noun (NOUN), full form of an adjective (ADJF), short form of an adjective (ADJS), comparative (COMP), personal form of the verb (VERB), infinitive form of the verb (INFN), full participle (PRTF), short participle (PRTS), gerund (GRND), numeral (NUMR), adverb (ADVB), noun-pronoun (NPRO), predicative (PRED), preposition (PREP), conjunction (CONJ), a particle (PRCL), interjection (INTJ). We were interested in occurrence of parts of speech as proposed by Francois, 2009 [10].

We did not use some variables described in paper [11] due to adaptation to our texts. We did not use variable connected with paragraph because our texts are very short. Texts do not have syntactic markup that is why the concept of a phrase was not used either.

First experiment was a binary classification of readability: A1 versus C2, A2 versus C2, B1 versus C2. With the help of Classification Tree, SVM and Logistic Regression algorithms the accuracy we got was really high, it was almost equal to 1.

Second experiment for texts classification of four levels got lower accuracy. An example of accuracy of text retrieval with B1 level of readability is shown in Table 1.

Table 1. Results of texts retrieval with B1 readability level.

kNN is a K nearest neighborhood method. Results received during the second experiment are worse than the first experiment with only two levels. Due to the fact that results of the Classification Tree method reached 99 %, we can say that the obtained results meet the needs.

To analyze the effect of each variable for the texts discrimination into 4 levels we ranked it by information gain ratio [12] (Table 2).

Table 2. Texts variables ranked by information gain ratio (top 10).

The first three variables have the highest information gain ratio; they are lexical ones. We can say that they are most important variables for discrimination.

4 Sentence Classification

Next task was to make a prototype of an algorithm to retrieve difficult sentences for further simplification. This algorithm is based on a sentence classification with respect to its readability. For results evaluation we use subcorpus of Russian national Corpus (RNC) – corpus SunTagRus [13] that has morphological and syntactic metadata. We manually tagged 3500 sentences from this subcorpus to mark their structural level of perception complexity. We found out that level B1 suits the majority of our students. So, we created a binary sentence markup, which is (1) B1 or lower than B1 and (2) Higher than B1.

Lexical difficulty markup was made on the basis of active vocabulary of three levels: A1, A2 and B1. The most complete vocabulary list (B1) includes 2500 words. So, we defined sentences having more than 33 % active vocabulary words as lexically difficult ones.

Thus, we have two kinds of markup: structural complexity and lexical difficulty. As an intersection of its lexical and structural level of difficulty we obtained markup of a total level of difficulty.

Dale-Chall model was developed to define the difficulty of text with the help of linear function of flowing variables: average sentence length (number of words divided by number of sentences) and rear words in the text.

When we use these variables for single sentence readability prediction we need to adapt them as following: sentence length rather than average sentence length, percentage of words not in the active vocabulary with respect to sentence length (number of words in the sentence not in the vocabulary divided by total number of words in the sentence) instead of rear words percentage. In our case, we don’t need to use dictionary of the Russian language with the frequency of words occurrence because we have a definite list of words that are contained in active vocabulary.

These two variables were automatically extracted for each sentence in our corpus. We predict readability for single sentence using different methods of machine learning as shown in Sect. 3. Evaluation was performed with the help of cross validation on the test part of corpus.

To evaluate the influence of each variables first we try to predict difficulty by using variables separately, next we predict it using both variables. It is easy to see that even in the case of prediction with the help of sentence length we can obtain good results. But if we need to classify to more than two numbers of levels, accuracy will decrease. Precision of difficult sentence retrieval is lower than simple sentence retrieval.

Accuracy of readability prediction on the basis of both variables is much higher. The second variable - percentage of words not in the active vocabulary cut off many difficult short sentences. It is effect to the precision of difficult sentence retrieval. The results are presented in Table 3.

Table 3. Results of readability prediction using variables: sentence length and percentage of words not in the active vocabulary.

We have opposite situation in this case. Precision of difficult sentence retrieval is higher than simple sentence retrieval. We can conclude that even using only these two variables we can effectively predict sentence readability.

Flesch-Kincaid model grade level formula was also used to determine readability. The formula utilized in the software is [(0.39 × ASL) + (11.8 × ASW) − 15.59], where ASL is the average sentence length (number of words divided by number of sentences) and ASW is the average syllables per word (number of syllables divided by number of words). To apply this formula to the problem of estimating the difficulty of the single sentence we can save ASW in its original form and instead of ASL use sentence length (number of words).

If we come to analyze how the lexical difficulty is predicted with the help of the average syllables per word (ASW) it is easy to notice that ASW exert to classification accuracy of difficult sentences and not exert to simple one. The reason is that Russian language is characterised by the presence of many long words (with many syllables), which are simple ones because they are created by combining short words. This is the help of two variables (ASL and ASW) we get results that are shown in Table 4.

Table 4. Results of difficult/simple sentence retrieval from text using ASL and ASW.

Total accuracy for only two variables is relatively high but the recall of simple sentences retrieval is quite low. Active vocabulary in the first certified level of Russian language could not be exactly determined using the average syllables per word.

5 Sentence Classification Using Syntactic Structure

We use deeper sentence features, which potentially can improve accuracy of readability prediction – syntactic relations of words. Our experiment was carried out on the basis of SynTagRus corpus, which has morphological and syntactic metadata. We decide to use syntactical features of a sentence as a basis of classification algorithm because this approach shows better results on the preliminary stage whether morphology features or n-gramms. In this case on the basis of syntactical features classification tasks look as follows. The sentences are tagged with morphological metadata using OpenCorpora [9]. On the basis of morphological marks we generate syntactical links. Its syntactical links help us to predict single sentence readability.

SynTagRus includes about 60 types of syntactic links grouped as it proposed in RNC. We try to predict sentence readability with the help of two data representation. First we use all 60 types of syntactic links. We get following experimental results shown in Table 5.

Table 5. Classification using 60 types of links.

Then we use aggregated links to 4 groups as it proposed in RNC. Classification accuracy using aggregated variables was lower. On the basis of obtained experimental results it was concluded that we should use all types of links without aggregation. The best precision and recall showed SVM algorithm.

It is obvious to assume that syntactic variables can predict structural difficulties better. Thus we used the same approach as it was with other previous models, perform experiment with structural and lexical difficulty separately. Results are presented in Table 6.

Table 6. Results of structural difficulties prediction using only syntactic variables.

We can conclude that syntactic variables allow predicting structural difficulties more efficiently than simple variables. Next we use all kind of variables (syntactic and lexical) to predict total difficulty of sentence. As a lexical variable we use percentage of words not from active vocabulary of the corresponding level (Table 7).

Table 7. Results of total readability prediction using all kinds of variables and syntactic links.

Last approach gives more stable results and may be used to increase the number of classes of sentence complexity (Table 8).

Table 8. Results of total readability prediction using all kinds of variables and syntactic links.

6 Conclusion

Classical models and models developed specially for Russian language were adapted to news texts retrieval. These models give good results. We managed to develop a precise classification system of news texts in Russian with respect to their readability.

Accuracy of four levels classification was lower. Due to the fact that obtained results of the Classification Tree and Random Forest methods reached 99–98 %, we can say that they met our needs.

We adapted traditional classification techniques with statistical features like Flesch-Kincaid and Dale-Chall to identify lexical and structural complexity of Russian sentences. These techniques were tested on set of sentences where readability was manually marked as binary classification.

Finally, we found one of the variants of the model for the effective identification of readability of Russian sentences with the use of syntactic links. We found that syntactic features can predict structural complexity. Total set of features with statistical, lexical and syntactical ones can predict sentence readability with 0.9661 amount of recall using Random Forest algorithm. Most important features for this classification are lexical ones.