Keywords

1 Introduction

Large volumes of textual data have become available since the emergence of the Web. It becomes gradually more challenging to digest the vast amount of information that exists in sources such as websites, news, blogs, books, scientific papers, and social media. Hence, text summarization has emerged as a popular field of study in the past few decades which aims to simplify and make more efficient the process of obtaining relevant piece of information.

Text summarization can be defined as automatically obtaining brief, fluent, and salient piece of text from a much longer and more detailed input text. The two main approaches to text summarization are extractive text summarization and abstractive text summarization. Extractive summarization aims to summarize a given input by directly copying the most relevant sentences or phrases without any modification according to some criteria and ordering them. Abstractive summarization, on the other hand, aims to automatically generate new phrases and sentences based on the given input and incorporate them in the output summary.

Evaluation of summarization methods is critical to assess and benchmark their performance. The main objective of evaluation is to observe how well the output summary is able to reflect the reference summaries. The commonly used evaluation methods in summarization such as ROUGE [17] and METEOR [3] are based on n-gram matching strategy. For instance, ROUGE computes the number of overlapping word n-grams between the reference and system summaries in their exact (surface) forms. While the exact matching strategy is not an issue for extractive summarization where the words are directly copied, it poses a problem for abstractive summarization where the generated summaries can contain words in different forms. In the abstractive case, this strategy is very strict especially for morphologically rich languages in which the words are subject to extensive affixation and thus carry syntactic features. It severely punishes the words that have even a slight change in their forms. Hence, taking the morphosyntactic structure of these morphologically rich languages into account is important for the evaluation of text summarization.

In this paper, we introduce several variants of the commonly used evaluation metrics that take into account the morphosyntactic properties of the language. As a case study for Turkish, we train state-of-the-art text summarization models mT5 [31] and BERTurk-cased [27] on the TR-News dataset [4]. The summaries generated by the models are evaluated with the proposed metrics using the reference summaries. In order to make comparisons between the evaluation metrics, we perform correlation analysis to see how well the score obtained with each metric correlates with the human score for each system summary-reference summary pair. Turkish is a low-resource language and it is challenging to find manually annotated data in text summarization. Hence, for correlation analysis, we annotate human relevancy judgements for a randomly sampled subset of the TR-News dataset and we make this data publicly availableFootnote 1. Correlation analysis is performed using the annotated human judgements to compare the performance of the proposed morphosyntactic evaluation methods as well as other popular evaluation methods.

2 Related Work

Text summarization studies in Turkish have been mostly limited to extractive approaches. A rule-based system is introduced by Altan [2] tailored to the economics domain. Çığır et al. [7] and Kartal and Kutlu [13] use classical sentence features such as position, term frequency, and title similarity to extract sentences and use these features in machine learning algorithms. Özsoy et al. [21] propose variations to the commonly applied latent semantic analysis (LSA) and Güran et al. [12] utilize non-negative matrix factorization method. Nuzumlalı and Özgür [19] study fixed-length word truncation and lemmatization for Turkish multi-document summarization.

Recently, large-scale text summarization datasets such as MLSum [28] and TR-News [4] have been released which enabled research in abstractive summarization in Turkish. The abstractive studies are currently very limited and they mostly utilize sequence-to-sequence (Seq2Seq) architectures. Scialom et al. [28] make use of the commonly used pointer-generator model [29] and the unified pretrained language model (UniLM) proposed by Dong et al. [10]. Baykara and Güngör [4] follow a morphological adaptation of the pointer-generator algorithm and also experiment with Turkish specific BERT models following the strategy proposed by Liu and Lapata [18]. In a later study, Baykara and Güngör [5] use multilingual pretrained Seq2Seq models mBART and mT5 as well as several monolingual Turkish BERT models in a BERT2BERT architecture. They obtain state-of-the-art results in both TR-News and MLSum datasets.

Most of the evaluation methods used in text summarization and other NLP tasks are more suitable for well-studied languages such as English. ROUGE [17] is the most commonly applied evaluation method in text summarization which basically calculates the overlapping number of word n-grams. Although initially proposed for machine translation, METEOR [3] is also used in text summarization evaluation. METEOR follows the n-gram based matching strategy which builds upon the BLEU metric [22] by modifying the precision and recall computations and replacing them with a weighted F-score based on mapping unigrams and a penalty function for incorrect word order. Recently, neural evaluation methods have been introduced which aim to capture semantic relatedness. These metrics usually utilize embeddings at word level such as Word mover distance (WMD) [15] or sentence level such as Sentence mover distance (SMD) [8]. BERTScore [32] makes use of the BERT model [9] to compute a cosine similarity score between the given reference and system summaries.

There has been very limited research in summarization evaluation for Turkish which has different morphology and syntax compared to English. Most of the studies make use of common metrics such as ROUGE and METEOR [21, 28]. Recently, Beken Fikri et al. [6] utilized various semantic similarity metrics including BERTScore to semantically evaluate Turkish summaries on the MLSum dataset. In another work [30], the BLEU+ metric was proposed as an extension to the BLEU metric by incorporating morphology and Wordnet into the evaluation process for machine translation.

3 Overview of Turkish Morphology

Turkish is an agglutinative language which makes use of suffixation extensively. A root word can take several suffixes in a predefined order as dictated by the morphotactics of the language. It is common to find words affixed with 5–6 suffixes. During the affixation process, the words are also subject to a number of morphophonemic rules such as vowel harmony, elisions, or insertions. There are two types of suffixes as inflectional suffixes and derivational suffixes. The inflectional suffixes do not alter the core meaning of a word whereas the derivational suffixes can change the meaning or the part-of-speech.

Table 1. Morphological analysis of an example sentence.

Table 1 shows the disambiguated morphological analysis of the sentence tutsağı serbest bıraktılar (they released the prisoner) as an example. The square bracket shows the root and its part-of-speech, which is followed by the suffixes attached to the root and the morphological features employed during the derivationFootnote 2.

4 Methodology

In this section, we explain the proposed methods that are based on the morphosyntactic features of Turkish and the evaluation metrics used in the study.

4.1 Morphosyntactic Variations

While comparing a system summary and a reference summary, the evaluation metrics used in text summarization use either the surface forms or the lemma or stem forms of the words. As stated in Sect. 1, the former approach is too restrictive and misses matches of the inflected forms of the same words, whereas the latter approach is too flexible and allows matches of all derivations of the same root which causes semantically distant words to match. In this work, we propose and analyze several other alternatives in between these two extreme cases based on morphosyntactic properties of the language. The obtained system and reference summaries are preprocessed according to the details of each proposed method before being passed to the evaluation metrics (ROUGE, METEOR, etc.). The implementation of the evaluation metrics are not changed. The proposed methods can easily be adapted to other morphologically rich languages in the case of readily available morphological analyzer tools.

Table 2. Proposed methods based on morphosyntactic variations of words.

Table 2 gives the list of the methods used to process the words before applying the evaluation metrics and shows the result of each one for the example sentence depicted in Table 1. The Surface method leaves the words in their written forms, while the Lemma (Stem) method strips off the suffixes and takes the lemma (stem) forms of the words. The lemma and stem forms are obtained using the Zemberek library [1] which applies morphological analysis and disambiguation processes. For the Lemma and Stem methods, in addition to their bare forms, six different variations based on different usages of the suffixes are employed. The suffixes used in these variations are also obtained from the morphological parse by the Zemberek library. Only the variations of the Lemma method are shown in the table to save space; the same forms are also applied to the Stem method. The methods are explained below.

Surface: The text is only lower-cased and punctuations are removed. All the other methods also perform the same cleaning and lower-casing operations. For Turkish, this is the default evaluation strategy for all the metrics.

Lemma: The text is lemmatized and the lemma forms of the words are used.

Stem: The text is stemmed and the stem forms of the words are used.

Lemma and all Suffixes: The text is lemmatized and the suffixes are extracted. The lemma and each suffix of a word are considered as separate tokens.

Lemma and Combined Suffixes: The text is lemmatized and the suffixes are extracted. The suffixes are concatenated as a single item. The lemma and the concatenated suffixes of a word are considered as separate tokens.

Lemma and Last Suffix: The text is lemmatized and the suffixes are extracted. The lemma and the last suffix of a word are considered as separate tokens.

The last three methods above split the lemma and the suffixes and use them as individual tokens. This may cause the same tokens obtained from different words to match mistakenly. For instance, if the system summary contains the word tutsağı (the prisoner) (the accusative form of tutsak (prisoner)) and the reference summary contains the word gardiyanı (the guardian) (the accusative form of gardiyan (guardian)), the morphological parse will output the suffix ’ı’ for both of them. The evaluation metric (e.g. ROUGE-1) will match these two suffixes (tokens) although they belong to different words. To prevent such cases, we devise another variation of these three methods where the surface form of the word is prefixed to each token generated from the word as explained below.

Lemma and all Suffixes with Surface: The text is lemmatized and the suffixes are extracted. The surface form of a word is added as a prefix to the lemma and each of the suffixes of the word. The lemma and each suffix of the word are then considered as separate tokens.

Lemma and Combined Suffixes with Surface: The text is lemmatized and the suffixes are extracted. The suffixes are concatenated as a single item. The surface form of a word is added as a prefix to the lemma and the concatenated suffixes of the word. The lemma and the concatenated suffixes of the word are then considered as separate tokens.

Lemma and Last Suffix with Surface: The text is lemmatized and the suffixes are extracted. The surface form of a word is added as a prefix to the lemma and the last suffix of the word. The lemma and the last suffix of the word are then considered as separate tokens.

4.2 Evaluation Metrics

We use five different metrics for comparing system summaries and reference summaries. We apply the morphosyntactic variations to the summaries and then score the performance using these metrics. In this way, we make a detailed analysis related to which combinations of evaluation metrics and morphosyntactic tokenizations correlate well with human judgments. We explain below each metric briefly.

ROUGE [17] is a recall-oriented metric which is commonly used in text summarization evaluation. ROUGE-N computes the number of overlapping n-grams between the system and reference summaries while ROUGE-L considers the longest common sub-sequence matches.

METEOR [3] is another commonly used metric in text summarization [14, 28]. It is based on unigram matches and makes use of both unigram precision and unigram recall. Word order is also taken into account via the concept of chunk.

BLEU [22] is a precision-oriented metric originally proposed for machine translation evaluation. It uses a modified version of n-gram precision and takes into account both the common words in the summaries and also the word order by the use of higher order n-grams. Although not common as ROUGE, BLEU is also used in text summarization evaluation as an additional metric [11, 23].

BERTScore [32] is a recent metric proposed to measure the performance of text generation systems. It extracts contextual embeddings of the words in the system and reference summaries using the BERT model and then computes pairwise cosine similarity between the words of the summaries.

chrF [24] is an evaluation metric initially proposed for machine translation. The F-score of character n-gram matches are calculated between system output and references. It takes into account the morphosyntax since the method is based on character n-grams.

In this work, we make use of the Huggingface’s evaluate libraryFootnote 3 for all the metrics explained above. We use the monolingual BERTurk-cased [27] model for computing the BERTScore values.

5 Dataset, Models, and Annotations

In this section, we first explain the dataset and the models used for the text summarization experiments. We then give the details of the annotation process where the summaries output by the models are manually scored with respect to the reference summaries. The human judgment scores will be used in Sect. 6 to observe the goodness of the proposed morphosyntactic methods.

5.1 Dataset

We use the TR-News [4] dataset for the experiments. TR-News is a large-scale Turkish summarization dataset that consists of news articles. It contains 277,573, 14,610, and 15,379 articles, respectively, for train, validation, and test sets.

5.2 Models

In this work, we use two state-of-the-art abstractive Seq2Seq summarization models. The models are trained on the TR-News dataset and used to generate the system summaries of a sample set of documents to compare with the corresponding reference summaries.

mT5 [31] is the multilingual variant of the T5 model [25] and closely follows its model architecture with some minor modifications. The main idea behind the T5 model is to approach each text-related task as a text-to-text problem where the system receives a text sequence as input and outputs another text sequence.

BERTurk-cased [27] is a bidirectional transformer network pretrained on a large corpus. It is an encoder-only model used mostly for feature extraction. However, Rothe et al. [26] proposed constructing a Seq2Seq model by leveraging model checkpoints and initializing both the encoder and the decoder parts by making several modifications to the model structure. Consequently, we constructed a BERT2BERT model using BERTurk-cased and finetuned it on abstractive text summarization.

The maximum encoder length for mT5 and BERTurk-cased are set to, respectively, 768 and 512, whereas the maximum decoder length is set to 128. The learning rate for the mT5 model is 1e−3 and for the BERTurk-cased model 5e−5. An effective batch size of 32 is used for both models. The models are finetuned for a maximum of 10 epochs where early stopping with patience 2 is employed based on the validation loss.

Table 3. Average scores and inter-annotator agreement scores for the models. In the first row, the averages of the two annotators are separated by the/sign.

5.3 Human Judgment Annotations

In order to observe which morphosyntactic tokenizations and automatic summarization metrics perform well in evaluating the performance of text summarization systems for morphologically rich languages, we need a sample dataset consisting of documents, system summaries, reference summaries, and relevancy scores between the system and reference summaries. For this purpose, we randomly sampled 50 articles from the test set of the TR-news dataset. For each article, the system summary output by the model is given a manual score indicating its relevancy with the corresponding reference summary. This is done for the mT5 model and the BERTurk-cased model separately. The relevancy scores are annotated by two native Turkish speakers with graduate degrees. An annotator is shown the system summary and the reference summary for an article without showing the original document and is requested to give a score. We decided to keep the annotation process simple by giving a single score to each system summary-reference summary pair covering the overall semantic relevancy of the summaries instead of scoring different aspects (adequacy, fluency, style, etc.) separately. The scores range from 1 (completely irrelevant) to 10 (completely relevant).

Table 3 shows the average scores of the annotators and the inter-annotator agreement scores. The averages of the two annotators are close to each other for both models. The Pearson correlation and Krippendorf’s alpha values being around 0.80–0.90 indicate that there is a strong agreement in the annotators’ scores. We also present the Cohen’s Kappa coefficient as a measure of agreement between the annotators. The values of 0.44 and 0.25 signal, respectively, moderate agreement and fair agreement between the scores [16]. Since the Cohen’s Kappa coefficient is mostly suitable for measuring agreement in categorical values rather than quantitative values as in our case, the results should be approached with caution.

Table 4. Pearson correlation results of the morphosyntactic methods with prefix tokens for the BERTurk-cased summarization model. Bold and underline denote, respectively, the best score and the second-best score for a column.
Table 5. Pearson correlation results of the morphosyntactic methods with prefix tokens for the mT5 summarization model. Bold and underline denote, respectively, the best score and the second-best score for a column.

6 Correlation Analysis

In this work, we mainly aim at observing the correlation between the human evaluations and the automatic evaluations for the system generated summaries. For each of the proposed morphosyntactic tokenization methods (Sect. 4.1), we first apply the method to the system and reference summaries of a document and obtain the tokenized forms of the words in the summaries. We then evaluate the similarity of the tokenized system and reference summaries with each of the standard metrics (Sect. 4.2). Finally, we compute the Pearson correlation between the human score (average of the two annotators) given to the reference summary-system summary pair (Sect. 5.3) and the metric score calculated based on that morphosyntactic tokenization.

In this way, we make a detailed analysis of the morphosyntactic tokenization method and text summarization metric combinations. The results are shown in Tables 4 and 5. For the ROUGE metric, we include the results for the ROUGE-1, ROUGE-2, and ROUGE-L variants that are commonly used in the literature. For the tokenization methods that include suffixes, we show only the results with the surface forms of the words prefixed to the tokens (with Surface). The results without the prefixed tokens are given in the Appendix. Interestingly, the methods that do not use the prefix forms correlate better with the human judgments, although they tend to produce incorrect matches as shown in Sect. 4.1.

We observe that the Lemma method mostly yields the best results for the summaries generated by the BERTurk-cased model. The Lemma method is followed by the Stem method. These results indicate that simply taking the root of the words in the form of lemma or stem before applying the evaluation metrics is sufficient instead of more complex tokenizations. One exception is the BERTScore metric which works best with the surface forms of the words. This may be regarded as an expected behavior since BERTScore is a semantically-oriented evaluation approach while the others are mostly syntactically-oriented metrics. Hence, when fed with the surface forms, BERTScore can capture the similarities between different orthographical forms of the words.

The summaries generated by the mT5 model follow a similar pattern in ROUGE evaluations. The Lemma method and the Stem method yield high correlations with human scores. On the other hand, the other three metrics correlate better with human judgments when suffixes are also incorporated as tokens into the evaluation process in addition to the lemma or stem form. The BERTScore metric again shows a good performance when used with the Surface method.

We observe a significant difference between the correlation scores of the BERTurk-cased model and the mT5 model. The higher correlation results of the BERTurk-cased model indicate that summaries with better quality are generated. This may be attributed to the fact that BERTurk-cased is a monolingual model unlike the multilingual mT5 model and this distinction might have enabled it to produce summaries with better and more relevant context.

The high correlation ratios obtained with the Lemma tokenization approach may partly be attributed to the success of the Zemberek morphological tool. Zemberek has a high performance in morphological analysis and morphological disambiguation for Turkish [1]. When the Lemma and Stem methods are compared, we see that the Lemma method outperforms the Stem method for both models and for all evaluation metrics. This is the case for both the bare forms of these two methods and their variations. The tokenization methods where the last suffixes are used follow the top-ranking Lemma and Stem methods in BERTurk-cased evaluations, whereas they fall behind the tokenization variations with all suffixes in mT5 evaluations. The motivation behind the last suffix strategy is that the last suffix is considered as one of the most informative morphemes in Turkish [20]. We see that this simple strategy is on par with those that use information of all the suffixes.

Finally, comparing the five text summarization evaluation metrics shows that METEOR yields the best correlation results for both models followed by the chrF metric. Although the underlying tokenization method that yields the best performance is different in the two models (Lemma for BERTurk-cased and Lemma with all suffixes in mT5), we can conclude that the METEOR metric applied to lemmatized system and reference summaries seems as the best metric for text summarization evaluation. This is an interesting result considering that ROUGE is the most commonly used evaluation metric in text summarization.

It should be noted that the Surface method corresponds to the approach used in the evaluation tools for these metrics. That is, the ROUGE, METEOR, BLEU, chrF, and BERTScore tools used in the literature mostly follow a simple strategy and work on the surface forms of the words. However, Tables 4 and 5 show that other strategies such as using the lemma form or using the lemma form combined with the suffixes nearly always outperform this default strategy. This indicates that employing morphosyntactic tokenization processes during evaluation increases correlation with human judgments and thus contributes to the evaluation process.

7 Conclusion

In this study, we introduced various morphosyntactic methods that can be used in text summarization evaluation. We trained state-of-the-art text summarization models on the TR-News dataset. The models were used to generate the system summaries of a set of documents sampled from the test set of TR-News. The relevancy of the system summaries and the reference summaries were manually scored and correlation analysis was performed between the manual scores and the scores produced by the morphosyntactic methods. The correlation analysis revealed that making use of morphosyntactic methods in evaluation metrics outperforms the default strategy of using the surface form for Turkish. We make the manually annotated evaluation dataset publicly available to alleviate the resource scarcity problem in Turkish. We believe that this study will contribute to focus on the importance of preprocessing in evaluation in this area.