Keywords

1 Introduction

The annotation and classification of scientific literature is a crucial task to make scientific knowledge easily discoverable, accessible, and reusable, accelerating scientific breakthroughs by helping scholars locate and understand the right research, making connections, and overcoming information overload. Some examples of efforts to structure scientific literature include scientific search engines like Semantic Scholar [1] and Microsoft Academic [23]. Both rely on knowledge graphs to enable a structured representation of scientific knowledge that supports applications like topic-driven search and recommendation. Similarly, scientific publishers have released knowledge graphs such as SN SciGraph [7] in order to more effectively organize their publications and increase automation. Other efforts like ORKG [8] rely on knowledge graphs to structure the actual contributions described in the publications, making research results on a specific topic comparable across the literature.

Publications are therefore being annotated with information about their content, which includes topics [1], fields of study [23], concepts [7], and research fields [8]. Such metadata is generally based on controlled vocabularies and arranged according to a taxonomy [7, 8], thesaurus [1, 23] or ontology [21]. In some cases, the annotation process can be fully automatic [1, 23]. However, authors are often asked to manually classify their contribution in the right categories, which is tedious and error-prone. In other occasions, this task falls under the responsibility of a reduced number of senior expert editors, making the process expensive and slow [21].

In this paper, we focus on the task of classifying scientific publications against a taxonomy of scientific disciplines. A wide variety of approaches are suitable for this task, including machine learning classifiers that rely on high-dimensional sparse representations [10], deep learning classifiers using dense representations [11], and rule-based or heuristic methods [21]. Encouraged by the success of recent developments in natural language processing and understanding, where pre-trained transformer language models dominate the state of the art [27], herein we focus on BERT [5] and its different flavors specialized in the scientific domain: BioBERT [16] and SciBERT [2].

Our experiments confirm that using transformers to train scientific classifiers generally results in greater accuracies compared to linear classifiers that were until now regarded as strong baselines [11]. We also observe that fine-tuning pre-trained transformers on domain-specific corpora contributes to this goal. However, despite previous research focused on interpreting and understanding how transformers encode information [4, 9, 15, 20, 25], the actual mechanism by which fine-tuning impacts on our classification task is still unclear. In an effort to shed light on this matter, we focus on analyzing the self-attention mechanism inherent of the transformer architecture [26]. Our findings show that the last layer of BERT attends to words that are semantically relevant for the scientific fields associated with each publication. This observation suggests that self-attention actually performs some type of feature selection for the fine-tuned model.

We investigate the possible relation between self-attention and feature selection methods from different perspectives, including vocabulary overlap, ranking similarity, domain relevance, feature stability, and classification performance. Our results open a future research path to determine whether injecting feature selection methods in the self-attention mechanism could derive even better results for single sequence classification using transformer architectures.

Our main contributions in this paper are the following:

  • We leverage the vertical pattern present in the transformer self-attention mechanism of BERT, SciBERT and BioBERT, where some words receive more attention on average than the rest of the words, and compare it against conventional feature selection methods used in text classification.

  • We find that self-attention has interesting properties as a feature selection method. The most attended words are in general more relevant to the publication domain than those found using conventional approaches to feature selection. The stability of the features resulting from self-attention is in line with the results obtained through conventional approaches. However, when used to learn classifiers from scratch, methods like chi-square and information gain contribute to train better classifiers.

  • We analyze from a semantic point of view the self-attention mechanism and quantify the amount of domain knowledge it encodes in the hidden states of the last layer. To this purpose, we rely on ConceptNet [24], a commonsense knowledge graph where attended words are mapped to concepts from which we derive their corresponding domains.

The remainder of the paper is structured as follows. Section 2 describes related work in the annotation of scientific publications, classification, transformer language models, and other work focused on the analysis of transformer self-attention. In Sect. 3, we present experimental results classifying research papers into a scientific taxonomy. In Sect. 4, we motivate the analysis of self-attention as feature selection with examples of attended words and scientific categories. In Sect. 5, we quantify the relation between self-attention and feature selection methods. Finally, Sect. 6 concludes the paperFootnote 1.

2 Related Work

Annotating research articles with entities such as research fields or topics is addressed in the literature using entity recognition and similarity measures between entity labels and their mentions [3]. In Microsoft Academic Graph [23] the candidate entities (field of study) are identified using string matching between the entity keywords and their paper mentions, then rules are applied to gather more candidates and to filter out the less relevant entities. Similarly, the CSO classifier [21], which assigns articles to concepts in the Computer Science OntologyFootnote 2, first identifies concepts explicitly mentioned in the text and then, in an effort to find entities not explicitly mentioned, it uses a similarity measure based on word embeddings. In the Semantic Scholar literature graph [1], an ensemble of tools is used to annotate entities: statistical models for entity span prediction and disambiguation, rules for string-based entity spotting, and off-the-shelf toolsFootnote 3.

In addition, different models can be used for this task, including SVM [10] or softmax classifiers [14]. Mai et al. [17] proposed classifiers based on convolutional [13] and recurrent neural networks [30] to annotate research articles. However, such deep learning classifiers need to be trained from scratch and depend on the network architecture. On the contrary, neural language models and particularly transformers like GPT-2 [19] or BERT [5] are pre-trained on a large corpus and then fine-tuned for classification by just adding a linear classifier to the model output. This approach has proven to successfully tackle several NLP tasks [27], including text classification. In the scientific domain, SciBERT [2] and BioBERT [16] have also reported state of the art results. Researchers are investigating the mechanics underlying BERT [20], analyzing its hidden states and outputs [9, 25], as well as the self-attention mechanism [4, 15]. Unlike previous approaches [4, 15], we semantically analyze the words that are attended above average in the last hidden state, leveraging the commonsense knowledge represented in ConceptNet, and quantify the relation between attention and feature selection methods often used in text classification.

Table 1. Language models pre-training information.

3 Fine-Tuning Language Models for Text Classification

We evaluate the use of language models on a text classification task where research articles are labeled with one or more knowledge fields. To this purpose, we choose: i) BERT and GPT-2, pre-trained on a general-purpose corpus, ii) SciBERT, pre-trained solely on scientific documents, and iii) BioBERT, pre-trained on a combination of general and scientific text. Table 1, provides relevant information about each language model, its pre-training and vocabulary. BioBERT uses the same tokenization method and vocabulary as BERT, while SciBERT adopts SentencePiece, based on WordPiece tokenization. The overlap between the vocabularies of BERT and SciBERT is 42%, which shows a substantial difference in the most frequently used words in the scientific domain and general-purpose documents. We choose the base version of BERT models (12 layers, 768 hidden size, 12 attention heads per layer) and a comparable model for GPT-2.

To fine-tune BERT, BioBERT and SciBERT on our multilabel classification task, we follow the guidelines provided by Devlin et al. [5] for single-sentence classification. We take the last layer encoding of the classification token <CLS> and add an N-dimensional linear layer, with N the number of classification labels. We use a binary cross-entropy loss function to allow the model to assign independent probabilities to each label. For GPT-2 we also add a linear layer on top of the last hidden state for the classification token. We train the models for 4 epochs, with batch size 8 and 2e–5 learning rate.

As a baseline, we use an SVM with a linear kernel [6]. We follow a one-vs-all strategy to train a binary SVM classifier per category, with grid search for the regularization parameter. We use WordNet to lemmatize the words, whenever they exist in the WordNet lexicon, and remove stop words. In addition, we use fastText [11] to learn a hierarchical softmax classifier using n-gram embeddings. We learn binary classifiers for each category, with automatic hyperparameter optimization to fix learning rate, number of epochs, and n-gram length.

We gather our dataset of scientific articles from a broad range of knowledge fields in SciGraph [7], where articles are labelled following the ANZSRCFootnote 4 taxonomy. This taxonomy comprises 22 first level categories, such as Economics, Law, and Computer Science, each of them with their own subcategory tree. From SciGraph, we extract the titles and abstracts of articles published in 2011 and 2012, as well as their categories. In total, we gather 405K papers, 187K from 2011 and the rest from 2012. In average, each first level category has 20,164 articles with a standard deviation of 31,791, which shows how unevenly the different categories are covered. Some of them are well represented, like Medical And Health Sciences, with 138,728 articles, while others, like Studies In Creative Arts And Writing, have little over a hundred articles.

We fine-tune the language models to learn to classify papers on any of the 22 first level categories. We train on papers only from 2011 and evaluate using 5-fold cross validation. Table 2 shows that the transformers pre-trained on a scientific corpus generally achieve greater f-measure in this task. The exception is BioBERT-1.0, which scores under BERT. BioBERT-1.0 was pre-trained on a lower number of steps than the other transformers, which could be affecting its performance. GPT-2 is the model producing the lowest f-measure, which shows evidence of a potential mismatch between the vocabulary and quality of the scientific corpus and the Web corpus where it was pre-trained, which may be undermining its performance. Overall, transformers produce more accurate classifiers than the linear methods used as baselines.

Table 2. Evaluation results of the multilabel classifiers (f-measure) on first level categories (a), and on second level categories (b).

To further explore the relation between the pre-training and fine-tuning corpora, we learn classifiers to label articles with second level categories in ANZSRC for some of the first level categories. For this experiment, we enlarge our dataset with articles published in 2012 and evaluate only the best language models, discarding BioBERT 1.0 and GPT-2. The results in Table 2 show that, in general, scientific categories are dominated by SciBERT and BioBERT-1.1. However, for categories in humanities, e.g. Language, and History and Archaeology, BERT produces better classifiers, providing evidence that the general-purpose knowledge encoded in BERT is more relevant in those cases. Interestingly, when there are few examples, e.g., in categories Built Environment and Creative Arts, the general knowledge encoded in BERT is of little use for the classifiers, while the scientific knowledge in BioBERT-1.1 and SciBERT contributes to achieve higher f-measure. Linear classifiers outperform transformer-based models in such under-represented categories.

4 Exploring Self-attention Heads

Above we show that BERT-based models are able to produce high performance multilabel classifiers. However, we know little about what makes them good at this task. In this section, we inspect the self-attention mechanism underpinning such models as a key element to understanding this behavior.

According to Clark et al. [4], attention weights indicate how relevant a particular word is when computing the next representation for the current word. To illustrate this statement, Fig. 1, depicts the mean weights of the 12 self-attention heads in the last hidden state of the fine-tuned models for two papers titled “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, and “A universal long-term flu vaccine may not prevent severe epidemics”. The plots clearly show the so-called vertical pattern [15], where a few tokens receive most of the attention, such as training, deep, transformer, language, and understanding in the first sentence, and flu, vaccine, prevent, severe and epidemic in the second. Note how while the vocabulary captured by SciBERT includes the word bidirectional, BERT uses subwords to represent it.

We do not include special tokens <SEP> and <CLS> since the amount of attention received by these tokens makes the attention received by the other tokens barely noticeable. Clark et al. [4] speculate that the attention on <SEP> in one head could indicate that the attention heads function is not applicable, while Rogers et al. [20] interpret the attention on <CLS> as the attention on a pooled sentence-level representation.

From these two examples, we observe that the most attended words in the last hidden state are highly related to the research fields of the articles: Computer science and Medical and Health Sciences. So, we look into this relation and identify the words that receive most vertical attention in the last hidden state for a subset of our dataset where each first level category is represented with at most 500 papers. First, for each input sequence we calculate the mean weights for the 12 attention heads in the last hidden state. Next, we generate a new weight matrix grouping subwords into words by averaging the subword weights. Finally, we gather the words with a vertical mean attention above the mean attention in the weight matrix. This results in 8,840 attended words for BERT, 17,773 for BioBERT, and 12,265 for SciBERT, corresponding to 16%, 32%, and 22% of the vocabulary managed by each language model.

Fig. 1.
figure 1

Average weights in the self-attention heads of the last hidden state.

Table 3 shows the top 20 most frequent attended words in three research fields: Biology, Computer Science and History and Archaeology. As can be noted, most of such words are highly related to the specific research field, appearing along a few punctuation marks and some stop words. While frequent attention to periods and commas was already reported in [4, 15], the reason why this happens is not clear yet. Rogers et al. [20] suggest that it must be related to model overparameterization while Clark et al. [4] point at the high frequency of these tokens in the corpus. Stop words are also highly frequent words and the models could be learning to attend to them as in the case of punctuation marks.

Table 3. Most attended words above average attention in the fine-tuned models.

5 Feature Selection

In the previous section we show that fine-tuned BERT models concentrate their attention on a subset of the overall vocabulary that ranges between 16% to 32% of the words. Following this observation, we hypothesize that such attention on a selected fragment of the vocabulary is the transformer version of feature selection. However, rather than picking the most interesting features for a classifier, self-attention selects words that heavily influence the representation of the rest of the words in the same sequence. We investigate whether there is a relation between feature selection algorithms commonly used for text classification and the most attended words in the fine-tuned language models.

We center our analysis on four feature selection methods used for text classification [14, 18, 22]: Chi-square (chi), Information Gain (ig), Document Frequency (df), and Categorical Proportional Difference (pd). Chi-square measures the lack of independence between a word and a class; its value is zero if the word and the class are independent. Information Gain measures the entropy reduction of the dataset when it is split by a feature value. Thus, words with larger information gain discriminate the data ensuring a lower entropy. Document Frequency counts the number of documents where a term appears. Categorical Proportional Difference measures the degree to which a word contributes to differentiating a particular category from others.

We compare the most attended words with those selected by the above-mentioned feature selection methods, and measure how similar the rankings of words sorted by their average attention are to the rankings produced by each feature selection method. In Table 4, we report the vocabulary overlap of the most attended words and feature selection methods after filtering out the stop words. The number of features selected was limited to the top k words, where k is the number of words attended above average by each language model. Indeed, the results indicate a large overlap. Fine-tuned language models for text classification attend up to 64% of the common terms returned by dc, the most simple of our feature selection baselines, which itself performs similarly to ig and chi [29]. For all three models, their most attended words have the largest overlap with document frequency, followed by information gain, chi-square and, finally, proportional difference.

Table 4. Word overlap: most attended vs. feature selection.
Fig. 2.
figure 2

Rank-biased overlap at different p values between most attended words and feature selection algorithms.

To measure the similarity between rankings we apply the Rank-Biased Overlap (RBO) [28] metric. RBO ranges between 0 to 1, from less to more similar, and was designed for non-conjoint rankings, i.e. both lists may have different items, may be incomplete and with different length. Through the p parameter, RBO models the probability to continue considering the overlap at the next rank, having examined the overlap at the previous rank. Figure 2 shows the RBO for the attention and feature selection rankings. We set p to 0.9, 0.99, 0.999, and 0.9999, indicating the model to assign the first 10, 100, 1,000, and 10,000 ranks respectively, approximately 85% to 86% of the weight of the evaluation.

While the BERT and SciBERT attended words rankings are more similar to the ranking of discriminative words (ig) for p values of 0.9 to 0.999, they finally converge with the ranking of common terms (dc), too. On the other hand, the BioBERT-1.1 ranking is clearly most similar to the common term rankings (dc). We think that the difference between the three models could be related to the subword vocabulary and pre-training corpus. Subword vocabularies are tightly related to the training corpus since they are generated to represent the whole corpus with the minimum number of word pieces. BERT trains its own subword vocabulary on a general corpus and during fine-tuning learns to attend more to discriminative words in the scientific domain. SciBERT also uses its own vocabulary trained on a limited scientific corpus, enabling the model to attend to discriminative words (like BERT) but also to common words due to the domain knowledge it encodes. BioBERT on the other hand reuses the BERT subword vocabulary and therefore many scientific terms are split in a suboptimal number of pieces. This has a negative impact on the ability of the self-attention mechanism to focus on discriminative words, and subsequently on the attention to common terms.

5.1 Domain Knowledge

We investigate the domain relevance of the words that are most attended by the language models and compare it with words produced by the feature selection methods. To this end, we search the words in ConceptNet and leverage the relation HasContext to identify the domains where they are commonly used. We manually map the 22 first level categories in ANZSRC to the corresponding concepts in ConceptNet. To deal with morphological variations like plurals and conjugations we use the FormOf relation, and to increase the coverage we traverse the isA type hierarchy one level up looking for the corresponding concept. For example, the word networking is a FormOf of the root word network, which in turn HasContext Computer Science and Electronics, and the concept Electronics isA type of Physics.

Table 5. Words per category matching the corresponding ConceptNet context.

For each first level category, we gather the top 100 most attended words, as well as those with the highest scores according to each feature selection method. Then, for each word, we look for the corresponding context according to ConceptNet. Table 5 reports the domain relevance obtained for each category. In BERT and SciBERT, self-attention identifies more domain-relevant words than feature selection methods. However, this is not the case for BioBERT. Recall that in our sample dataset, the set of most attended words produced by BioBERT is the largest (32%) with respect to the vocabulary, which is a clear indication that the model spreads its attention more widely. Weighing the words by their term frequency (TF), attended words remain more domain-relevant than those obtained through feature selection. In fact, the domain relevance of the frequent attended words is greater or on pair with those selected when TF/IDF is used to weigh the output of feature selection methods: self-attention takes into account not only the importance of words in the document (TF) but also their importance in the document collection (IDF).

5.2 Feature Evaluation

To evaluate the quality of the resulting features we measure their stability and their classification performance. Stability is the robustness of a feature subset generated from different training sets from the same distribution [12]. To measure stability we compute the mean Jaccard coefficient between the different subsets of words generated by each method. We apply 5-fold cross-validation and process each fold with the fine-tuned language models and the feature selection methods. Stability is reported on Table 6, where we can see that language models attend to the same words with stability values in line with those reported by document count. Attended words are more stable than the rest of the feature selection methods, including chi-square and information gain, which seems to be more volatile across folds.

Table 6. Stability of the features measured using Jackard similarity coefficient

In addition, we use the set of features to learn classifiers for the 22 first level categories using Logistic Regression (LR), Naive Bayes (NB), Random Forest (RF), Neural Networks (NN), and SVM. The neural network comprises an embedding matrix of 100 dimensions and a fully connected layer using sigmoid as activation function. For the SVM the regularization parameter is tuned and for the remaining algorithms we use the recommended settings. We evaluate the classifiers using 5-fold cross validation on the subset of documents where each category was represented with up to 500 papers. The f-measure of the classifiers is shown in Fig. 3. In general, we observe that traditional feature selection methods like chi-square and information gain mainly help to learn more accurate classifiers than the set of most attended words by the language models. This observation clearly indicates that the success of BERT models in this task is not only driven by the self-attention mechanism but also by the contextualized outputs of the transformer, which are the input of the added classification layer.

Fig. 3.
figure 3

Classifiers performance using distinct feature sets and number of features.

6 Conclusions

In this paper, we investigate the self-attention mechanism of BERT in a fine-tuning scenario for the classification of scientific articles over a taxonomy of research fields. We observe that attention in the fine-tuned model is focused on words that are highly relevant to the research field of each article. Furthermore, we notice that the most attended words represent just a fraction of the whole vocabulary: a hint that self-attention performs a sort of feature selection.

We systematically compare the most attended words against those resulting from feature selection methods normally used in text classification. We show that language models and feature selection methods like information gain and chi-square share between 42% to 55% of the selected words. We also observe that the attention-based word rankings produced by the transformers are more similar to those obtained using document frequency and information gain.

From our experiments we conclude that self-attention focuses more on words that are relevant to each research domain than the words produced through conventional feature selection. However, self-attention is not as good to learn classifiers from scratch, especially compared to chi-square and information gain. While self-attention identifies domain-relevant terms the discriminatory information in the fine-tuned model is encoded on the output representations and the additional classification layer. As future work, we plan to investigate the impact of integrating, perhaps as part of the loss function, optimal feature selection methods during fine-tuning of transformer for single sequence classification.