Using Morphological and Semantic Features for the Quality Assessment of Russian Wikipedia

Lewoniewski, Włodzimierz; Khairova, Nina; Węcel, Krzysztof; Stratiienko, Nataliia; Abramowicz, Witold

doi:10.1007/978-3-319-67642-5_46

Włodzimierz Lewoniewski¹¹,
Nina Khairova¹²,
Krzysztof Węcel¹¹,
Nataliia Stratiienko¹² &
…
Witold Abramowicz¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 756))

Included in the following conference series:

International Conference on Information and Software Technologies

Abstract

Nowadays, the assessment of the quality and credibility of Wikipedia articles becomes increasingly important. We propose to use morphological and semantic features to estimate the quality of Wikipedia articles in Russian language. We distinguished over 150 linguistic features and divided them into four groups. In these groups, we considered the features of encyclopedic style, readability and subjectivism of the article’s text. Based on Random Forest as a classification algorithm, we show the most importance linguistic features that affect the quality of Russian Wikipedia articles. We compare the classification results of our four linguistic features groups separately. We have achieved the F-measure of 89,75%.

Access provided by CONRICYT-eBooks. Download conference paper PDF

Determining Quality of Articles in Polish Wikipedia Based on Linguistic Features

Assessing the Quality of Thai Wikipedia Articles Using Concept and Statistical Features

Comparative Analysis of the Informativeness and Encyclopedic Style of the Popular Web Information Sources

Keywords

1 Introduction

Nowadays, Wikipedia is the biggest public universal encyclopedia with a free content, which includes over 44 million articles. Most articles in the Wikipedia are comparable in quality to those in the Encyclopedia Britannica [1]. Usually, in order for a Wikipedia article to reach the good quality it must be revised by Wikipedia community many times. This is the main reason for the growing interest and popularity of research on assessment of Wikipedia articles quality.

In 2006, during the Opening plenary at Wikimania, Jimmy Wales suggested concentrating on quality of the articles instead of their number [9]. The best articles of Wikipedia must follow the specific style guidelines. Such guidelines can be quantified in many ways. One of the approaches is to use morphological, syntactic and semantic features of words, which allow evaluating the quality of the Wikipedia articles. Obviously, these features strongly depend on a specific language.

As of April 2017, the Russian-language edition of Wikipedia had more than 1,3 million articles^{Footnote 1} and more than 1 billion page views per month^{Footnote 2}. The Russian Wikipedia subdomain (ru.wikipedia.org) receives approximately 8% of Wikipedia’s cumulative traffic, and takes second place after English subdomain (59%, en.wikipedia.org).^{Footnote 3}

There are a lot of articles that study the correlation between English linguistic characteristics and estimating the quality of articles in English Wikipedia. However, studies examining the use of Russian linguistic characteristics to evaluate the quality of texts are very few.

In this paper we focus on using morphological and semantics features of the Russian language to estimate the quality of Russian Wikipedia articles. We suggest applying the Random Forests algorithm of that is based on these features in order to automatically identify quality classes of Wikipedia articles.

2 Related Work

All experts admit that there are some difficulties in determining the quality of the Wikipedia articles. Furthermore Wikipedia isn’t a static resource; their amount keeps growing every day. Also that fact that the articles cover different topics complicates the task [11]. It means it requires that experts from different disciplines judge the quality, but such experts are not always available.

Measuring an article’s quality in Wikipedia is not an easy task for human users, complexity of which repeatedly increases in case of the task of automatic evaluation of the article quality. Now there exist enough studies concerning the problems related to automatic estimating the quality of Wikipedia articles. We can divide all research literature into three groups. The first group of researches is based on characteristics related to contributors’ reputations and edit network, article status, external factual support and other features [4, 5, 17]. However, often such methods require complex calculations and they do not analyze on the content of the article itself.

The second group of the studies focuses on the calculation of volume of different articles components. These studies showed that a better quality article usually are longer, have more images and sections, use bigger number of references [8, 14, 15]. These quantitative features are used in online service WikiRank^{Footnote 4} for the automatic relative assessment of the articles in various language versions of Wikipedia. In some Wikipedia articles we can find special quality flaw templates, which can also help in articles assessment [3].

The third group of the studies concerning the task of automatic estimating the quality based on linguistic characteristics of text in Wikipedia articles [2, 6]. Other studies used linguistic features to examine how density of factual information impact on quality of Wikipedia articles [13, 18]. Such approaches that direct to exploring the linguistic characteristics of articles might be useful for improvement of the articles quality. For example, it concerns such characteristics as the writing style of an article, the number of verbs, facts, the number of diverse nouns and similar features. However, linguistic characteristics of the text depend on the article’s language. Nowadays, Wikipedia contains articles in approximately 300 languages. One of the main language versions of the online encyclopedia is Russian. There a lot of articles on using linguistic characteristics to estimate the quality of Wikipedia articles in English or Spanish but very few use peculiar properties of Russian linguistic characteristics [10].

This is the first study that use more than 150 features related to Russian language to predict articles quality in Wikipedia. In order to tokenize texts of Russian Wikipedia articles and extract various linguistic features we use own approach. This approach use different open morphological libraries and dictionaries available on the Web. We also add additional rules to this algorithm at the stage of preparation of the text, as well as during the extraction of some features.

3 Description of the Experiment

The best Wikipedia articles must be well-written, comprehensive, well-researched, neutral and must follow the specific style guidelines.^{Footnote 5} The main idea of the approach is that the linguistic features of words or sentences of the articles allow evaluating the style of writing, the brevity, correctness, readable and some others of the Wikipedia articles characteristics. In some cases, semantic and syntactic features of the words allow even to evaluate subjectivity of the article authors.

3.1 Linguistic Features

We distinguish several groups of linguistics features that can affect the quality of Russian Wikipedia articles. The first group includes morphological features such as parts of speech, specific morphological characteristics of a particular part of speech. For instance, we determine the number of verbs and then we determine the number of verb categories - tense, person, etc. Herewith, we use more than 50 similar characteristics. In order to analyze the morphological features, we apply the pymorphy2^{Footnote 6}, the library for morphological analysis of the Russian language that is based on the OpenCorpora dictionary^{Footnote 7} which is also used to denote grammatical tags (some of them are presented in Table 1).

Table 1. Description of some grammatical tags used in the study. Source: http://opencorpora.org/dict.php?act=gram

Full size table

The second group of the applicable linguistic features includes some semantic features, integral morphological features of the words and even the parameters of word formation. We suppose that the features from the second group can explicitly express the existence of some subjective assessment or opinion of the Wikipedia article authors. Therefore, the presence of these characteristics in the text can affect the quality of the article.

Typically the value judgments are represented by the various linguistic means and characteristics in the text. For example, such morphological features as personal and possessive pronouns of the first and second person can contribute evaluative-expressive shades to a statement. Herewith, one of the main grammatical means of adding of the author’s subjectivity and expressiveness in Russian is affectionate diminutive suffixes.

Moreover, each natural language has a specific vocabulary that expresses emotions, mentality and adds a tinge of author’s opinion in the statement. We have created two special vocabularies that express such shade in Russian. The first vocabulary includes more than 300 words and the combination of words (avt_ocenka). The second one includes only verbs that have the certain semantic component of subjectivity (menverb). It includes 120 speech verbs (such as tell, recall, dictate and others), 154 feelings verbs and 103 emotions verbs (such as wish, rejoice, worry and others) [13]. Additionally, in this group of the features, we use the glossary of introductory turnovers from the Russian National Corpus.^{Footnote 8}

Table 2 shows our full list of the word features that can express some elements of subjective assessment of the Wikipedia article authors.

Table 2. Linguistic features of the words that can express some elements of subjective assessment of the Wikipedia article authors

Full size table

The third group of the applicable linguistic features allows making exploratory conclusions about the readability of the texts. We have included in this group both characteristics that are commonly used to assess the complexity of texts as well as new characteristics based on dictionaries of the Russian National Corpus, the Russian Internet corps I-RU [12] and the Open Corpora. Traditionally the estimation of readability is based on features such as the statistical average word length (in characters and in syllables), the sentence length, the maximum number of words in a sentence, the number of unique words (uslov) and some others [11].

In addition to the listed characteristics, we also highlight the following statistical indicators: the number of words having 3 syllables and more (slog3), the number of words having 4 syllables and more (slog4), the number of words having 5 syllables and more (slog5), the number of unique words of specific parts of speech (uverb, unoun, uadj).

Furthermore, we assume that the frequency of word usage in texts correlates with their comprehensibility and readability. Therefore, we can include the lists of the most frequent words in the Russian language in the third group of the linguistic features that affect the readability of the texts. Table 3 shows these features that take into account different lists of the most frequent words in the Russian language.

Table 3. Features that take into account different lists of the most frequent words in the Russian language.

Full size table

The total number of the third group of the applicable linguistic categories reaches 40.

The fourth group of the applicable linguistic features characterizes an encyclopedic style of an article. An encyclopedia-style article should display a comprehensive view of the subject matter in a simple and understandable manner. In the general case, such style means the condensed presentation of material, which identifies the subject sufficiently, completely, naturally and authentically.

We argue that such style can be represented explicitly by the various linguistic means and characteristics in the text. We have included in this group such proper names as the first name of the person (name), the last name of the person (surn), the middle name of the person (patr), a name (orgn), and a trademark (Trad) of the organisation and toponyms (Geox). We also believe that the list of the most popular words of Russian Wikipedia can represent the encyclopedic style of the article (250wiki)

Additionally, we have included amounts of simple and complex facts of the article to the fourth group of the applicable linguistic features. According to the logical-linguistic model of fact extraction from English [7] or Russian Texts [13], the simple fact (fact) in a Russian sentence is the smallest grammatical clause that includes a verb and a noun; the complex fact (FactPlus1, FactPlus2) in Russian texts is a grammatical sentence that includes a verb and a few nouns. Among these nouns, one has to play the semantic role of the Subject (FactPlus1) and the other has to be the Object (FactPlus2)^{Footnote 9}.

3.2 Source Data

Our dataset includes all articles from Russian Wikipedia that have manual evaluation of their quality, i.e. about 130,000 (April 2017). According to the previous studies [14, 15], we distinguish two quality classes of the Russian Wikipedia articles. We called the first class GoodEnough: it includes articles that are evaluated by the Wikipedia community as Featured and Good. The second class is called NeedsWork; it includes I, II, III and IV level (stub) articles. One of the peculiarities of Russian Wikipedia is the availability of such an assessment of the quality of the article as Solid. According to the binary classification, this grade can be classified either as GoodEnough or NeedsWork. In order to show peculiarity of the group of articles that are evaluated as Solid, we consider three versions of the classification. They are FG-standard, FGS-standard and FG-S standard.

Table 4 shows the distributions of the analyzed articles according to the grade of assessment quality.

Table 4. The distributions of the analyzed articles according to the grade of assessment quality.

Full size table

4 Implementation Aspects and Experimental Results

Analysis has shown that usually, articles with high-quality grades have the higher value of a particular feature. On Fig. 1 is shown the distribution of some features among different quality grades in Russian Wikipedia. The used Random Forests classifier determines the probability that an article belongs to one of the two classes. The classifier allows us to use the specific analytical methods to explore hidden patterns, rules and dependencies between different linguistic features. At the same time, the Random Forests classifier allows calculating the predictive power of the different features and every group of the applicable linguistic features.

As already mentioned before, better articles usually have more text (including characters, words, sentences). So we can expect that the value of a majority of the considered linguistic characteristics is more in articles with better quality. Therefore, we decided to normalize all features by word count, sentences count and character count (without spaces) separately. On Fig. 2 it is shown distribution of some features normalized by words.

Typically, the encyclopedic style of a Wikipedia article requires that the article includes a brief definition or description of the assigned subject, which is called “The lead section” followed by a broad examination of the topic, which is called “The 1st section” followed by a number of sub-sections. We have evaluated the precision, recall and F-Measure for three way of the normalization and for three analysed areas: the lead section, the 1st section, the whole article’s text.

Table 5 shows that the evaluation of the linguistic parameters of the whole article is more significant than the evaluation of the linguistic parameters of the lead section and the 1st section only. According to the table, there is not much difference in F-measure between the various way of the normalization. We decided to normalize our features by the number of words based on the research of corpus linguistics [16].

Table 5. Classication results using various types of the normalisation and three versions of the classification standards.

Full size table

The Random Forest classifier can show the importance of features in the model. It provides two straightforward methods for feature selection: mean decrease impurity and mean decrease accuracy. Table 6 shows 30 most importance features, which are based on average impurity decrease. Table 7 shows 30 most important features based on number of nodes using that attribute. Every feature is normalized by the number of words of the corpus class. Additionally, as was mentioned before, the linguistic parameters correspond to the whole article.

Table 6. 30 most important linguistic features based on average impurity decrease

Full size table

Table 7. 30 most important linguistic features based on number of nodes using that features

Full size table

We found that except for the morphological categories the main features affecting the quality of Russian Wikipedia articles are such semantic characters as the simple fact or the complex fact [13], and such characters of the subjective assessment as a verb that have the certain semantic component of subjectivity. Moreover, one of the main feature to classify the Russian Wikipedia article are correlated features of the number of the verbs that do not have the semantic component of subjectivity and the number of the facts that do not have the semantic component of subjectivity.

We also analyzed the classification efficiency using separate parameters for each of our four linguistic features groups. The results reported in Table 8 were obtained using the random forest classifier with features of the encyclopedic, morphological, readability, subjectivism groups separately.

Table 8. Classication results using the encyclopedic, morphological, readability, subjectivism features groups separately.

Full size table

Additionally, we analyzed classification results using two versions of the classification standards. They are FGS standard and FG standard.

There are significant differences of results between the FGS version of classification and FG classification. The precision, recall and F-measure are significantly higher when Solid articles are referred to the class NeedsWork articles.

5 Conclusions and Future Works

In this work, we proposed to exploit linguistic features of an article for assessing Wikipedia content quality. We distinguished and categorized over 150 linguistic features of Russian Wikipedia articles. We divided all the linguistic characteristics into four groups: morphological features, semantic features that can explicitly express the existence of some subjective assessment or opinion of the authors, the features that are exploratory conclusions about the readability of the text and the features that characterize the encyclopedic style of the article.

We found that the most important groups of linguistic characteristics that affect the quality of Russian Wikipedia articles are the parts of speech and semantic features of the simple fact and the complex fact. Moreover, such correlated features as the number of the verbs and the number of the facts that do not have the semantic component of subjectivity possess the great predictive power of classification of the quality of the articles. Our experiments on a subset of the Russian Wikipedia revealed that frequency dictionaries are poorly effective in the problem of classifying the quality of articles.

Our experiments showed that the evaluation of the linguistic features of the whole article is more significant than the evaluation of them for some sections of the text. We also investigated the use of three versions of the articles classification standards depending on the position of Solid Articles. Using FG schema allowed achieving the F-measure of the classification results of 89,75%.

While the initial results are very promising, more in-depth investigations of these linguistic features are needed. We guess that the most effective way is to apply our linguistic features with others parameters that affect the Wikipedia articles quality.

In future work, we plan to conduct similar experiments for other languages to analyze how linguistic features of different languages affects the quality of Wikipedia articles. Additionally, we are going to expand the list of semantic variables and also consider the quality of the articles in a more complex categorization.

Notes

1.
https://meta.wikimedia.org/wiki/List_of_Wikipedias.
2.
https://analytics.wikimedia.org.
3.
http://www.alexa.com/siteinfo/wikipedia.org.
4.
http://wikirank.net.
5.
https://en.wikipedia.org/wiki/Wikipedia:Featured_article_criteria.
6.
http://pymorphy2.readthedocs.io.
7.
http://opencorpora.org.
8.
http://www.ruscorpora.ru/en/.
9.
Detailed definitions of the simple and complex facts are given in [13].

References

Michael, B.: Wikipedia Or Encyclopædia Britannica: Which Has More Bias? Forbes (2015). http://www.forbes.com/sites/hbsworkingknowledge/2015/01/20/wikipedia-or-encyclopaedia-britannica-which-has-more-bias. Accessed 15 June 2017
Xu, Y., Luo, T.: Measuring article quality in Wikipedia: Lexical clue model. In Web Society (SWS). In: 2011 3rd Symposium on IEEE, pp. 141–146 (2011)
Google Scholar
Anderka, M.: Analyzing and predicting quality flaws in user-generated content: the case of wikipedia. Ph.D., Bauhaus-Universitaet Weimar Germany (2013)
Google Scholar
Kittur, A., Kraut, R.E.: Harnessing the wisdom of crowds in wikipedia: quality through coordination. In: Proceedings of the 2008 ACM conference on Computer Supported Cooperative Work, pp. 37–46. ACM (2008)
Google Scholar
Velázquez, C.G., Cagnina, L.C., Errecalde, M.L.: On the feasibility of external factual support as Wikipedia’s quality metric. Procesamiento del Lenguaje Natural 58, 93–100 (2017)
Google Scholar
Lipka, N., Stein, B.: Identifying featured articles in wikipedia: writing style matters. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1147–1148 (2010)
Google Scholar
Khairova, N., Petrasova, S., Gautam, A.: The logical-linguistic model of fact extraction from english texts. In: International Conference on Information and Software Technologies, CCIS 2016, Communications in Computer and Information Science, pp. 625–635 (2016)
Google Scholar
Warncke-Wang, M., Cosley, D., Riedl, J.: Tell me more: an actionable quality model for Wikipedia. In: Proceedings of the 9th International Symposium on Open Collaboration (2013)
Google Scholar
Giles, G.: Internet encyclopaedias go head to head. Nature 438, 900–901 (2005)
Article Google Scholar
Panicheva, P., Ledovaya, Y., Bogolyubova, O.: Lexical, morphological and semantic correlates of the dark triad personality traits in russian facebook texts. In: Artificial Intelligence and Natural Language Conference (AINL), pp. 1–8. IEEE (2016)
Google Scholar
Lenzner, T.: Are readability formulas valid tools for assessing survey question difficulty? Sociol. Methods Res. 43(4), 677–698 (2014)
Article MathSciNet Google Scholar
Sharoff, S., Umanskaya, E., Wilson, J.: A frequency dictionary of Russian: core vocabulary for learners, Routledge (2014)
Google Scholar
Khairova, N., Lewoniewski, W., Wecel, K.: Estimating the quality of articles in russian Wikipedia using the logical-linguistic model of fact extraction. In: International Conference on Business Information Systems, pp. 28–42 (2017)
Google Scholar
Węcel, K., Lewoniewski, W.: Modelling the quality of attributes in wikipedia infoboxes. In: Abramowicz, W. (ed.) BIS 2015. LNBIP, vol. 228, pp. 308–320. Springer, Cham (2015). doi:10.1007/978-3-319-26762-3_27
Chapter Google Scholar
Lewoniewski, W., Węcel, K., Abramowicz, W.: Quality and importance of wikipedia articles in different languages. In: Dregvaite, G., Damasevicius, R. (eds.) ICIST 2016. CCIS, vol. 639, pp. 613–624. Springer, Cham (2016). doi:10.1007/978-3-319-46254-7_50
Chapter Google Scholar
Rebuschat, P.E., Detmar, M., McEnery, T.: Language learning research at the intersection of experimental, computational and corpus-based approaches, Language Learning (2017)
Google Scholar
Wu, G., Harrigan, M., Cunningham, P.: Characterizing wikipedia pages using edit network motif profiles. In: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, pp. 45–52. ACM (2011)
Google Scholar
Lex, E., Voelske, M., Errecalde, M., Ferretti, E., Cagnina, L., Horn, C., Granitzer, M.: Measuring the quality of web content using factual information, In: Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality, pp. 7–10. ACM (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Poznań University of Economics and Business, Poznań, Poland
Włodzimierz Lewoniewski, Krzysztof Węcel & Witold Abramowicz
National Technical University “Kharkiv Polytechnic Institute”, Kharkiv, Ukraine
Nina Khairova & Nataliia Stratiienko

Authors

Włodzimierz Lewoniewski
View author publications
You can also search for this author in PubMed Google Scholar
Nina Khairova
View author publications
You can also search for this author in PubMed Google Scholar
Krzysztof Węcel
View author publications
You can also search for this author in PubMed Google Scholar
Nataliia Stratiienko
View author publications
You can also search for this author in PubMed Google Scholar
Witold Abramowicz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Włodzimierz Lewoniewski .

Editor information

Editors and Affiliations

Kaunas University of Technology, Kaunas, Lithuania
Robertas Damaševičius
Kaunas University of Technology, Kaunas, Lithuania
Vilma Mikašytė

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lewoniewski, W., Khairova, N., Węcel, K., Stratiienko, N., Abramowicz, W. (2017). Using Morphological and Semantic Features for the Quality Assessment of Russian Wikipedia. In: Damaševičius, R., Mikašytė, V. (eds) Information and Software Technologies. ICIST 2017. Communications in Computer and Information Science, vol 756. Springer, Cham. https://doi.org/10.1007/978-3-319-67642-5_46

Download citation

DOI: https://doi.org/10.1007/978-3-319-67642-5_46
Published: 23 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67641-8
Online ISBN: 978-3-319-67642-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Using Morphological and Semantic Features for the Quality Assessment of Russian Wikipedia

Abstract

Similar content being viewed by others

Determining Quality of Articles in Polish Wikipedia Based on Linguistic Features

Assessing the Quality of Thai Wikipedia Articles Using Concept and Statistical Features

Comparative Analysis of the Informativeness and Encyclopedic Style of the Popular Web Information Sources

Keywords

1 Introduction

2 Related Work