Abstract
Readability is a linguistic feature that indicates how difficult it is to read a text. Traditional readability formulas were made for the English language. This study evaluates their adequacy to the Portuguese language. We applied the traditional formulas in 10 parallel corpora. We verified that the Portuguese language had higher grade scores (less readability) in the formulas that use the number of syllables per words or number of complex words per sentence. Formulas that use letters by words instead of syllables by words output similar grade scores. Considering this, we evaluated the correlation of the complex words in 65 Portuguese school books of 12 schooling years. We found out that the concept of complex word as a word with 4 or more syllables, instead of 3 or more syllables as originally used in traditional formulas applied to English texts, is more correlated with the grade of Portuguese school books. In the end, for each traditional readability formula, we adapted it to the Portuguese language performing a multiple linear regression in the same dataset of school books.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Readability refers to the difficulty in reading a particular text. Its automatic assessment is an important research topic nowadays. It is essential for efficient learning (it is important for a student to read texts that are appropriate to his level of education), evaluating automatic text simplification methods, etc. One way to evaluate readability is through formulas that consider lexical difficulty (word difficulty, for example, assessed by the average number of syllables per word) and syntactic difficulty (sentence length, for example, assessed by the average number of words per sentence). The classic formulas of readability were prepared for English, and there are no equivalent formulas for the Portuguese language. Any adaptation of these formulas to the Portuguese language will have to take into account the main differences between the two languages. For example, the number of syllables or characters per word is, on average, higher in the Portuguese language.
This research is divided into two phases. In the first phase, we use English-Portuguese parallel corpora to compare the application of traditional formulas in the two languages. For this phase, we evaluate the main linguistic differences between the two languages. In the second phase, we discuss how the differences found can be applied to the Portuguese readability assessment, using a set of 65 Portuguese school books. In the end, we propose new readability formulas to the Portuguese language adapted from each English readability formula. In both phases, we consider five traditional readability formulas present in Table 1. All the formulas give the required grade level to understand a text.
2 Background and Related Work
The possible use of traditional readability formulas in other languages is not new. These have already been applied to school texts in the Brazilian Portuguese language [10]. Martins et al. introduced a change of 42 points in the Flesch reading ease test, due to the higher number of syllables in the Portuguese words when compared to the English language. Authors found that the adaptation of the Flesch formula score (42 points decrease in readability) was more pronounced in the texts of the elementary school years.
A study carried out in 2012 [9] compared the readability of five books of translation courses in English and their translation into the Persian language. The used readability formulas were the Gunning Fog Index (GFI) and the Flesch New Reading Ease formula. Samples of texts were randomly chosen from each original book. The results showed that texts translated into the Persian language were less readable than the original English texts.
In addition to the Persian language, in 2014, a similar study was carried out comparing the readability between the Swedish and English languages [15]. Three algorithms were used: Coleman-Liau Index (CLI), Läsbarhetsindex (LIX) and Automated Readability Index (ARI). The texts used were a collection of Wikipedia articles, “On the Origin of Species” by Charles Darwin and the Bible and their respective translations. The tests showed that both ARI and LIX work for both Swedish and English on less readable texts. CLI, however, seems to perform less well on these more demanding texts but works better on the Bible. The conclusion was that ARI and LIX work on difficult and average to read texts in both English and Swedish and that CLI only works on accessible texts in both languages.
This work will solely focus on traditional measures of readability. These measures are the most used, easy to compute and there is a lack of adapted formulas to non-English languages. Other approaches, like classification models using new features provided by natural language processing [3, 4], or even the recent use of word embeddings [1, 7] will be ignored.
3 Readability Comparison in EN-PT Parallel Corpora
We use multiple parallel corpora in English and Portuguese obtained from the OPUS websiteFootnote 1 [13, 14], a collection of translated texts from the web. To cover different topics and different levels of readability, we analyze different linguistic corpora within the OPUS collection. Overall, we analysed 10 parallel corpora: PHP (PHP programming language documentation), Wikipedia (parallel sentences extracted from Wikipedia), ECB (documentation from the European Central Bank), Europarl (translated texts obtained from the European Parliament website), OpenSubtitles (Movie and TV series Subtitles in multiple languages), TED2013 (TED talks subtitles), EUconst (A parallel corpus collected from the European Constitution), ParaCrawl (Parallel corpora from Web Crawls), News-Commentary11 (News Commentaries), and GlobalVoices (news from the Global Voices website). For each parallel corpus, we analyze a TMX file (Translation Memory eXchange - an XML specification for the exchange of translation data). For each TMX file, we calculate the readability of 10 randomly selected excerpts, where each excerpt is composed of 100 translation units. We used an open source Java libraryFootnote 2 to calculate the readability of extracts.
To analyze the differences between the scores obtained for the two languages, we performed a paired samples Wilcoxon test for each readability formula. We used the non-parametric Wilcoxon test because the Shapiro-Wilk’s method showed that the distribution of data is significantly different from the normal distribution. The results of this test can be found in Table 2. It can be verified that the ARI and Coleman Liau metrics show smaller differences than the other readability metrics. The Coleman Liau metric does not show significant differences between the two languages (p-value > 0.05). The reason for this discrepancy between the metrics seems to lie in the inclusion/exclusion of the number of syllables of the words and of complex words (words with 3 or more syllables) in the respective formulas. In Table 1, we see that only the ARI and Coleman Liau metrics use the number of characters by word, instead of the number of syllables by word or complex words. Figure 1 shows the readability distribution for all metrics in both languages. Only the ARI and Coleman Liau metrics maintain similar scores across languages, unlike other metrics.
By this analysis, we see that existing the readability metrics initially formulated for the English language, need changes to be used in Portuguese texts, especially those that use the number of syllables or the amount of complex words. A simple method will be adding a constant to the original formulas. That constant would be the mean difference between the formula scores of the languages found in the parallel corpora. However, in the next section, we present another approach using Portuguese school books.
4 Readability Assessment of Portuguese School Books
We analyse linguistic features in a set of Portuguese school books from elementary through high school (through 1–12 grades). The books include Portuguese native learning, study of the environment, history, biology, geology, physics and chemistry courses of a well-known Portuguese publisher of school books. A total of 65 books were analyzed. Each page of a book is in the XHTML format, so we parsed it to clean the text. Finally, we used the previously mentioned Java library to parse the texts and extract related readability parameters.
The differences found in the parallel corpora points to a difference in the average number of syllables between the two languages. The concept of complex words used in the traditional readability formulas is defined as words with 3 or more syllables. We performed the Kendall correlation test between grade level and different types of complex words and found that the number of complex words per text with 4 or more syllables is more correlated with grade level (r = 0.347 for words with 4 or more syllables and r = 0.310 for words with 3 or more syllables). For the Portuguese language, given the higher number of syllables per word in comparison with the English language, it seems more correct to consider a word as difficult if it has 4 or more syllables.
We performed a multiple linear regression using the parameters of the original English readability formulas. For each original formula, we adjust it to the Portuguese language using the corresponding parameters. Based on the early finding about the complex words, SMOG and Gunning Fog measures for the Portuguese language consider a complex word a word with 4 or more syllables. We averaged the parameters used in the traditional formulas for each grade. We did this because we found out a large variance on the texts of a school year, and a linear regression using the simple features of the traditional formulas leads to bad results. Only the use of more complex features provided by natural language processing and machine learning could lead to better performances [5], and, as already mentioned, these approaches are ignored in this study. The final formulas to the Portuguese language are presented in Table 3. We apply these formulas to each year of schooling; the results are shown in Fig. 2.
5 Conclusions
In this work, we adjust the traditional readability metrics, formulated for the English, to Portuguese. Firstly, we analyze the grade score differences between the two languages using ten parallel corpora. The Portuguese language has, on average, a greater number of syllables per words. However, these differences are not as significant in the number of letters per word, since ARI and Coleman Liau metrics don’t differ so much between the two languages.
Using 65 Portuguese school books, we found out that in the Portuguese language a complex word with 4 or more syllables, instead of 3 syllables or more, is more correlated with the readability. For each traditional English formula, we performed a multiple linear regression with the same corresponding parameters, leading to a new formula adjusted to the Portuguese language.
References
Cha, M., Gwon, Y., Kung, H.T.: Language modeling by clustering with word embeddings for text readability assessment. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, pp. 2003–2006. ACM, New York (2017)
Coleman, M., Liau, T.L.: A computer readability formula designed for machine scoring. J. Appl. Psychol. 60, 283–284 (1975)
Collins-Thompson, K.: Computational assessment of text readability: a survey of current and future research. ITL - Int. J. Appl. Linguist 165(2), 97–135 (2015)
Feng, L., Jansche, M., Huenerfauth, M., Elhadad, N.: A comparison of features for automatic readability assessment. In: COLING 2010 Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 276–284. Association for Computational Linguistics, Stroudsburg (2010)
François, T., Miltsakaki, E.: Do NLP and machine learning improve traditional readability formulas? In: Proceedings of the First Workshop on Predicting and Improving Text Readability for Target Reader Populations, PITR 2012, pp. 49–57. Association for Computational Linguistics, Stroudsburg (2012)
Gunning, R.: The Technique of Clear Writing. McGraw-Hill, New York (1952)
Jiang, Z., Gu, Q., Yin, Y., Chen, D.: Enriching word embeddings with domain knowledge for readability assessment. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 366–378. Association for Computational Linguistics, Santa Fe (2018)
Kincaid, J.: Derivation of new readability formulas: (automated readability index, fog count and Flesch reading ease formula) for navy enlisted personnel. Research Branch report, Chief of Naval Technical Training, Naval Air Station Memphis (1975)
Kolahi, S., Shirvani, E.: A comparative study of the readability of english textbooks of translation and their Persian translations. Int. J. Linguist. 4, 344–366 (2012)
Martins, T.B.F., Ghiraldelo, C.M., Nunes, M.D.G.V., Oliveira Junior, O.N.D.: Readability Formulas Applied to Textbooks in Brazilian Portuguese (1996)
McLaughlin, H.G.: SMOG grading - a new readability formula. J. Read. 12(8), 639–646 (1969)
Smith, E.A., Senter, R.: Automated readability index. In: AMRL-TR. Aerospace Medical Research Laboratories, pp. 1–14 (1967)
Tiedemann, J.: News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Bontcheva, K., Angelova, G., Mitkov, R. (eds.) Recent Advances in Natural Language Processing, vol. V, pp. 237–248. John Benjamins, Amsterdam (2009)
Tiedemann, J.: Parallel data, tools and interfaces in opus. In: Chair, N.C.C., et al. (eds.) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association (ELRA), Istanbul, Turkey, May 2012
Tillman, R., Hagberg, L.: Readability algorithms compability on multiple languages (2014)
Acknowledgment
This work is financed by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia within the project: UID/EEA/50014/2019. We would also like to thank the Master in Informatics and Computing Engineering of the Faculty of Engineering of the University of Porto for supporting the registration and travel costs.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Antunes, H., Lopes, C.T. (2019). Analyzing the Adequacy of Readability Indicators to a Non-English Language. In: Crestani, F., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2019. Lecture Notes in Computer Science(), vol 11696. Springer, Cham. https://doi.org/10.1007/978-3-030-28577-7_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-28577-7_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28576-0
Online ISBN: 978-3-030-28577-7
eBook Packages: Computer ScienceComputer Science (R0)