Keywords

1 Introduction

Dialogism is a philosophical theory introduced by Mikhail Bakhtin [1, 2], centered on the idea that everything, even life, is dialogic, a continual exchange and interaction between voices: “Life by its very nature is dialogic... when dialogue ends, everything ends” [2]. Trausan-Matu et al. [3] extended the concept of voice for analyzing discourse, in general, and collaborative learning, in particular. They consider voices to be generalized representations of different points of view or ideas, which spread throughout the discourse, and influence it. Voices were subsequently operationalized by Dascalu et al. [4] as semantic chains that were obtained by combining lexical chains, i.e., sequences of repeated or related words, including synonyms or hypernyms [5]. Semantic chains propagate along sentences and help create narrative threads throughout the text.

Recent studies on building lexical chains consider word repetitions, synonyms, and semantic relationships between nouns [6]. Mukherjee et al. [6] used lexical chains to distinguish easy from difficult medical texts. Identifying lexical chains that signal a difficult sentence helps in the simplification process. Olena [7] proposed a method for identifying lexical chains based on graphs, in which the nodes represent the terms in the document, and the edges the semantic relations between them. More recently, Ruas et al. [8] combined lexical chains with word embeddings to classify documents.

We introduce and evaluate a novel operationalization of voices using BERT [9], a state-of-the-art language model. This model enhances even further the Cohesion Network Analysis graph from the ReaderBench framework [10, 11] by integrating semantic links of related concepts, indicative of semantic flow [12].

2 Method

A specific dataset with examples of links was required to identify the attention heads from BERT capable of detecting semantic links between words that belong to the same chain. A set of simple heuristics were used to extract links from sample texts, for all pairs of words tagged as noun, verb, or pronoun that fulfil one of the following conditions: a) repetitions of words having the same lemma; b) synonyms, hypernyms, or siblings in the WordNet taxonomy [13]; and c) coreferences identified using spaCyFootnote 1. The TASA corpusFootnote 2 was selected as reference due to its diversity and covered complexity levels. The “correct” pairs were extracted from the entire dataset using the previous rules, while the “incorrect” ones were randomly sampled with 10% probability from all pairs of words that were not selected (i.e., otherwise, the number of negative samples would have been one order of magnitude larger than “correct” semantic associations). In total, 49 million word pairs were extracted, out of which around 20 million were positive examples.

Transformer-based models, in particular BERT [9], build contextual representations of words by stacking multi-head attention layers. Besides state-of-the-art results obtained on a vast range of tasks in Natural Language Processing, these models also provide insights regarding the importance of words and the relations between them by looking at the attention values. Clark et al. [14] explored the interpretability of different attention heads from different layers of the BERT model. The authors show that attention heads can be used to identify specific syntactic functions or perform coreference resolution.

No single attention head is accurate enough to predict these kinds of semantic relationships between words. Therefore, a prediction model that learns to combine the attention values from all the attention heads between two words was trained on the dataset constructed based on TASA. By considering both directions of the attention heads, 288 scores were used in total, similar to the approach used by Clark et al. [14]. An issue to be tackled was the limited sequence length accepted by the pretrained BERT model (i.e., 512 tokens). Texts in the TASA dataset, but also in general, can be longer; thus, a sliding window was used to compute the attention weights for all pairs of words. The sliding window had a length of 256 for efficiency reasons, but also because semantic chains usually do not contain links that are too far apart. An overlap of 128 tokens was used so that words on different sides of the window could still be connected; if two different attention values are computed between the same two words (because of this overlap), the maximum value was used as the weight.

The previously described prediction model was used to score all pairs of words that are within a given distance in the text. The next step consisted of grouping these pairs of words into sets of semantically related words, i.e., semantic chains. In order to filter the links based on the predicted weight, a fixed threshold was experimentally set at 0.90. The semantic chains are selected in the form of connected components from the resulting graph.

3 Results

Different architectures for identifying semantic links were trained and evaluated: a linear model that only computes one weight for each attention head, and Multi-Layer Perceptron (MLP) with one or two hidden layers. All models return one number passed through a Sigmoid activation (see Table 1).

Table 1. Link prediction results.

An interactive view developed using Angular 6 (https://angular.io) was introduced to display the semantic chains - see Fig. 1 for a text selected from the dataset described in McNamara et al. [15]. Each sentence is represented in a row, while rows are grouped in their corresponding paragraph. Words and links from a semantic chain share the same color. A higher density of the chains extracted with our method can be observed in contrast to classical lexical chains. Surprising relations not present in the constructed dataset can be seen in the generated chains. The linear model found connections between “colonists” and “Boston”, or between “help” and “supplies”, while the MLP model identified connections between “British” and “Great Britain” as a compound word. This example also shows that choosing the best model between linear and MLP is not straightforward, despite the substantial performance improvement of the latter on the word pairs dataset. Even though the linear model cannot perfectly learn the simple heuristics used to build the initial dataset, it can retrieve new insightful connections between words.

Fig. 1.
figure 1

Visualizations of a) lexical chains [5], b) semantic chains using the linear model, and c) semantic chains using the MLP model.

4 Conclusions

A novel method for identifying semantic links is introduced using only the attention scores computed by BERT, a core task for operationalizing dialogism as a discourse model. Choosing which attention heads are relevant for this task and how to combine them was achieved by building a dataset with pairs of words with simple rules. The introduced visualization argues for a more dense capturing of inner semantic links between words and even compound words, which are quite sparse when considering manually defined synsets from WordNet. Our aim is to further extend this model with sentiment analysis features derived from local contexts captured by BERT, thus further enriching the analysis with the identification of convergent and divergent points of view.