Analyzing the Adequacy of Readability Indicators to a Non-English Language

Antunes, Hélder; Lopes, Carla Teixeira

doi:10.1007/978-3-030-28577-7_10

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11696))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1254 Accesses
4 Citations
3 Altmetric

Abstract

Readability is a linguistic feature that indicates how difficult it is to read a text. Traditional readability formulas were made for the English language. This study evaluates their adequacy to the Portuguese language. We applied the traditional formulas in 10 parallel corpora. We verified that the Portuguese language had higher grade scores (less readability) in the formulas that use the number of syllables per words or number of complex words per sentence. Formulas that use letters by words instead of syllables by words output similar grade scores. Considering this, we evaluated the correlation of the complex words in 65 Portuguese school books of 12 schooling years. We found out that the concept of complex word as a word with 4 or more syllables, instead of 3 or more syllables as originally used in traditional formulas applied to English texts, is more correlated with the grade of Portuguese school books. In the end, for each traditional readability formula, we adapted it to the Portuguese language performing a multiple linear regression in the same dataset of school books.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Investigating the Robustness of Reading Difficulty Models for Russian Educational Texts

Readability Formula for Chinese as a Second Language: An Exploratory Study

Article 01 December 2019

Software to Determine the Readability of Written Documents by Implementing a Variation of the Gunning Fog Index Using the Google Linguistic Corpus

Keywords

1 Introduction

Readability refers to the difficulty in reading a particular text. Its automatic assessment is an important research topic nowadays. It is essential for efficient learning (it is important for a student to read texts that are appropriate to his level of education), evaluating automatic text simplification methods, etc. One way to evaluate readability is through formulas that consider lexical difficulty (word difficulty, for example, assessed by the average number of syllables per word) and syntactic difficulty (sentence length, for example, assessed by the average number of words per sentence). The classic formulas of readability were prepared for English, and there are no equivalent formulas for the Portuguese language. Any adaptation of these formulas to the Portuguese language will have to take into account the main differences between the two languages. For example, the number of syllables or characters per word is, on average, higher in the Portuguese language.

This research is divided into two phases. In the first phase, we use English-Portuguese parallel corpora to compare the application of traditional formulas in the two languages. For this phase, we evaluate the main linguistic differences between the two languages. In the second phase, we discuss how the differences found can be applied to the Portuguese readability assessment, using a set of 65 Portuguese school books. In the end, we propose new readability formulas to the Portuguese language adapted from each English readability formula. In both phases, we consider five traditional readability formulas present in Table 1. All the formulas give the required grade level to understand a text.

Table 1. Traditional readability formulas used.

Full size table

2 Background and Related Work

The possible use of traditional readability formulas in other languages is not new. These have already been applied to school texts in the Brazilian Portuguese language [10]. Martins et al. introduced a change of 42 points in the Flesch reading ease test, due to the higher number of syllables in the Portuguese words when compared to the English language. Authors found that the adaptation of the Flesch formula score (42 points decrease in readability) was more pronounced in the texts of the elementary school years.

A study carried out in 2012 [9] compared the readability of five books of translation courses in English and their translation into the Persian language. The used readability formulas were the Gunning Fog Index (GFI) and the Flesch New Reading Ease formula. Samples of texts were randomly chosen from each original book. The results showed that texts translated into the Persian language were less readable than the original English texts.

In addition to the Persian language, in 2014, a similar study was carried out comparing the readability between the Swedish and English languages [15]. Three algorithms were used: Coleman-Liau Index (CLI), Läsbarhetsindex (LIX) and Automated Readability Index (ARI). The texts used were a collection of Wikipedia articles, “On the Origin of Species” by Charles Darwin and the Bible and their respective translations. The tests showed that both ARI and LIX work for both Swedish and English on less readable texts. CLI, however, seems to perform less well on these more demanding texts but works better on the Bible. The conclusion was that ARI and LIX work on difficult and average to read texts in both English and Swedish and that CLI only works on accessible texts in both languages.

This work will solely focus on traditional measures of readability. These measures are the most used, easy to compute and there is a lack of adapted formulas to non-English languages. Other approaches, like classification models using new features provided by natural language processing [3, 4], or even the recent use of word embeddings [1, 7] will be ignored.

3 Readability Comparison in EN-PT Parallel Corpora

We use multiple parallel corpora in English and Portuguese obtained from the OPUS website^{Footnote 1} [13, 14], a collection of translated texts from the web. To cover different topics and different levels of readability, we analyze different linguistic corpora within the OPUS collection. Overall, we analysed 10 parallel corpora: PHP (PHP programming language documentation), Wikipedia (parallel sentences extracted from Wikipedia), ECB (documentation from the European Central Bank), Europarl (translated texts obtained from the European Parliament website), OpenSubtitles (Movie and TV series Subtitles in multiple languages), TED2013 (TED talks subtitles), EUconst (A parallel corpus collected from the European Constitution), ParaCrawl (Parallel corpora from Web Crawls), News-Commentary11 (News Commentaries), and GlobalVoices (news from the Global Voices website). For each parallel corpus, we analyze a TMX file (Translation Memory eXchange - an XML specification for the exchange of translation data). For each TMX file, we calculate the readability of 10 randomly selected excerpts, where each excerpt is composed of 100 translation units. We used an open source Java library^{Footnote 2} to calculate the readability of extracts.

To analyze the differences between the scores obtained for the two languages, we performed a paired samples Wilcoxon test for each readability formula. We used the non-parametric Wilcoxon test because the Shapiro-Wilk’s method showed that the distribution of data is significantly different from the normal distribution. The results of this test can be found in Table 2. It can be verified that the ARI and Coleman Liau metrics show smaller differences than the other readability metrics. The Coleman Liau metric does not show significant differences between the two languages (p-value > 0.05). The reason for this discrepancy between the metrics seems to lie in the inclusion/exclusion of the number of syllables of the words and of complex words (words with 3 or more syllables) in the respective formulas. In Table 1, we see that only the ARI and Coleman Liau metrics use the number of characters by word, instead of the number of syllables by word or complex words. Figure 1 shows the readability distribution for all metrics in both languages. Only the ARI and Coleman Liau metrics maintain similar scores across languages, unlike other metrics.

Table 2. Paired samples Wilcoxon test between English and Portuguese texts.

Full size table

By this analysis, we see that existing the readability metrics initially formulated for the English language, need changes to be used in Portuguese texts, especially those that use the number of syllables or the amount of complex words. A simple method will be adding a constant to the original formulas. That constant would be the mean difference between the formula scores of the languages found in the parallel corpora. However, in the next section, we present another approach using Portuguese school books.

4 Readability Assessment of Portuguese School Books

We analyse linguistic features in a set of Portuguese school books from elementary through high school (through 1–12 grades). The books include Portuguese native learning, study of the environment, history, biology, geology, physics and chemistry courses of a well-known Portuguese publisher of school books. A total of 65 books were analyzed. Each page of a book is in the XHTML format, so we parsed it to clean the text. Finally, we used the previously mentioned Java library to parse the texts and extract related readability parameters.

The differences found in the parallel corpora points to a difference in the average number of syllables between the two languages. The concept of complex words used in the traditional readability formulas is defined as words with 3 or more syllables. We performed the Kendall correlation test between grade level and different types of complex words and found that the number of complex words per text with 4 or more syllables is more correlated with grade level (r = 0.347 for words with 4 or more syllables and r = 0.310 for words with 3 or more syllables). For the Portuguese language, given the higher number of syllables per word in comparison with the English language, it seems more correct to consider a word as difficult if it has 4 or more syllables.

We performed a multiple linear regression using the parameters of the original English readability formulas. For each original formula, we adjust it to the Portuguese language using the corresponding parameters. Based on the early finding about the complex words, SMOG and Gunning Fog measures for the Portuguese language consider a complex word a word with 4 or more syllables. We averaged the parameters used in the traditional formulas for each grade. We did this because we found out a large variance on the texts of a school year, and a linear regression using the simple features of the traditional formulas leads to bad results. Only the use of more complex features provided by natural language processing and machine learning could lead to better performances [5], and, as already mentioned, these approaches are ignored in this study. The final formulas to the Portuguese language are presented in Table 3. We apply these formulas to each year of schooling; the results are shown in Fig. 2.

Table 3. Adjusted Portuguese formulas.

Full size table

5 Conclusions

In this work, we adjust the traditional readability metrics, formulated for the English, to Portuguese. Firstly, we analyze the grade score differences between the two languages using ten parallel corpora. The Portuguese language has, on average, a greater number of syllables per words. However, these differences are not as significant in the number of letters per word, since ARI and Coleman Liau metrics don’t differ so much between the two languages.

Using 65 Portuguese school books, we found out that in the Portuguese language a complex word with 4 or more syllables, instead of 3 syllables or more, is more correlated with the readability. For each traditional English formula, we performed a multiple linear regression with the same corresponding parameters, leading to a new formula adjusted to the Portuguese language.

Notes

References

Cha, M., Gwon, Y., Kung, H.T.: Language modeling by clustering with word embeddings for text readability assessment. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, pp. 2003–2006. ACM, New York (2017)
Google Scholar
Coleman, M., Liau, T.L.: A computer readability formula designed for machine scoring. J. Appl. Psychol. 60, 283–284 (1975)
Article Google Scholar
Collins-Thompson, K.: Computational assessment of text readability: a survey of current and future research. ITL - Int. J. Appl. Linguist 165(2), 97–135 (2015)
Article Google Scholar
Feng, L., Jansche, M., Huenerfauth, M., Elhadad, N.: A comparison of features for automatic readability assessment. In: COLING 2010 Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 276–284. Association for Computational Linguistics, Stroudsburg (2010)
Google Scholar
François, T., Miltsakaki, E.: Do NLP and machine learning improve traditional readability formulas? In: Proceedings of the First Workshop on Predicting and Improving Text Readability for Target Reader Populations, PITR 2012, pp. 49–57. Association for Computational Linguistics, Stroudsburg (2012)
Google Scholar
Gunning, R.: The Technique of Clear Writing. McGraw-Hill, New York (1952)
Google Scholar
Jiang, Z., Gu, Q., Yin, Y., Chen, D.: Enriching word embeddings with domain knowledge for readability assessment. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 366–378. Association for Computational Linguistics, Santa Fe (2018)
Google Scholar
Kincaid, J.: Derivation of new readability formulas: (automated readability index, fog count and Flesch reading ease formula) for navy enlisted personnel. Research Branch report, Chief of Naval Technical Training, Naval Air Station Memphis (1975)
Google Scholar
Kolahi, S., Shirvani, E.: A comparative study of the readability of english textbooks of translation and their Persian translations. Int. J. Linguist. 4, 344–366 (2012)
Google Scholar
Martins, T.B.F., Ghiraldelo, C.M., Nunes, M.D.G.V., Oliveira Junior, O.N.D.: Readability Formulas Applied to Textbooks in Brazilian Portuguese (1996)
Google Scholar
McLaughlin, H.G.: SMOG grading - a new readability formula. J. Read. 12(8), 639–646 (1969)
Google Scholar
Smith, E.A., Senter, R.: Automated readability index. In: AMRL-TR. Aerospace Medical Research Laboratories, pp. 1–14 (1967)
Google Scholar
Tiedemann, J.: News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. In: Nicolov, N., Bontcheva, K., Angelova, G., Mitkov, R. (eds.) Recent Advances in Natural Language Processing, vol. V, pp. 237–248. John Benjamins, Amsterdam (2009)
Chapter Google Scholar
Tiedemann, J.: Parallel data, tools and interfaces in opus. In: Chair, N.C.C., et al. (eds.) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association (ELRA), Istanbul, Turkey, May 2012
Google Scholar
Tillman, R., Hagberg, L.: Readability algorithms compability on multiple languages (2014)
Google Scholar

Download references

Acknowledgment

This work is financed by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia within the project: UID/EEA/50014/2019. We would also like to thank the Master in Informatics and Computing Engineering of the Faculty of Engineering of the University of Porto for supporting the registration and travel costs.

Author information

Authors and Affiliations

Faculdade de Engenharia da Universidade do Porto, Porto, Portugal
Hélder Antunes & Carla Teixeira Lopes
INESC TEC, Porto, Portugal
Carla Teixeira Lopes

Authors

Hélder Antunes
View author publications
You can also search for this author in PubMed Google Scholar
Carla Teixeira Lopes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Hélder Antunes or Carla Teixeira Lopes .

Editor information

Editors and Affiliations

Universita della Svizzera Italiana, Lugano, Switzerland
Fabio Crestani
Zurich University of Applied Sciences, Winterthur, Switzerland
Martin Braschler
University of Neuchâtel, Neuchâtel, Switzerland
Jacques Savoy
Technische Universität Wien, Vienna, Austria
Andreas Rauber
HES-SO Valais-Wallis, Sierre, Switzerland
Henning Müller
University of Santiago de Compostela, Santiago de Compostela, Spain
David E. Losada
Swiss Alliance for Data-Intensive Services, Thun, Switzerland
Gundula Heinatz Bürki
University of Padua, Padua, Italy
Linda Cappellato
University of Padua, Padua, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Antunes, H., Lopes, C.T. (2019). Analyzing the Adequacy of Readability Indicators to a Non-English Language. In: Crestani, F., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2019. Lecture Notes in Computer Science(), vol 11696. Springer, Cham. https://doi.org/10.1007/978-3-030-28577-7_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-28577-7_10
Published: 03 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28576-0
Online ISBN: 978-3-030-28577-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Analyzing the Adequacy of Readability Indicators to a Non-English Language

Abstract

Similar content being viewed by others

Investigating the Robustness of Reading Difficulty Models for Russian Educational Texts

Readability Formula for Chinese as a Second Language: An Exploratory Study

Software to Determine the Readability of Written Documents by Implementing a Variation of the Gunning Fog Index Using the Google Linguistic Corpus

Keywords

1 Introduction

2 Background and Related Work

3 Readability Comparison in EN-PT Parallel Corpora

4 Readability Assessment of Portuguese School Books

5 Conclusions

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Analyzing the Adequacy of Readability Indicators to a Non-English Language

Abstract

Similar content being viewed by others

Investigating the Robustness of Reading Difficulty Models for Russian Educational Texts

Readability Formula for Chinese as a Second Language: An Exploratory Study

Software to Determine the Readability of Written Documents by Implementing a Variation of the Gunning Fog Index Using the Google Linguistic Corpus

Keywords

1 Introduction

2 Background and Related Work

3 Readability Comparison in EN-PT Parallel Corpora

4 Readability Assessment of Portuguese School Books

5 Conclusions

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation