Introduction

Relation extraction (RE) is a crucial component of natural language processing (NLP) and serves as a vital link between downstream tasks, such as event extraction (EE) [1] and knowledge graph construction, and upstream tasks, such as named entity recognition (NER) [2, 3] and entity linking (EL) [4]. Based on a predefined set of relationships, the objective of RE is to identify the relationships between two entities within a given text. Three instances are depicted in Fig. 1, where the subjects and objects are marked in the sentence part, and the labels of the three instances belong to three different categories.

Fig. 1
figure 1

An example including three types of relations of RE task

Fig. 2
figure 2

Text similarity between the relationships from TACRED dataset. a Including “no relation” category. b Not including “no relation” category

With the development of pre-trained language models (PLMs) based on the Transformers architecture [5,6,7]. Supported by a large training corpus, PLMs have shown remarkable performance in representing long sentences across a diverse array of NLP tasks. Particularly in supervised sentence-level RE, researchers have proposed models that incorporate PLMs, their performance far superior to those based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) [8,9,10,11]. Leveraging the information available in the dataset is the key step of the RE task. The focus of most primary works has been to develop efficient ways to utilize the textual information of a sentence [12]. To this end, entity masking [10] has been proposed as a technique to leverage the entity information present in the text, and its effectiveness has been remarkable. However, researchers have overlooked the wealth of semantic information conveyed by labels, which can differ significantly across different categories of labels [13]. The semantic information conveyed by labels plays a crucial role if there is a data imbalance. To investigate whether different labels carry significant differences in their inherent textual semantics, which can be used to strictly differentiate them from each other, the textual semantics of the dataset’s label set is calculated by SBERT [14]. The semantic similarity among the 11 relations in the TACRED dataset is demonstrated in Fig. 2. Based on Fig. 2a, it can be observed that both “per” and “org” type labels have a low semantic similarity to the “no relation” category. Furthermore, Fig. 2b illustrates the semantic similarity between the “per” and “org” types, which is also relatively small. It can be inferred from this experiment that there exist significant dissimilarities in the textual semantics of the labels.

Noteworthy, both Nayak et al. [15] and Mondal et al. [16] noticed that the context category is essential to embody the sentences. Peng et al. [17] have noted that the model may acquire certain surface information of the dataset through entity mentions, thereby impeding the model’s contextual comprehension. The researchers evaluated the efficacy of incorporating contextual information with entity mentions in their approach and contrasted it with approaches that solely relied on either entity mentions or contextual information. Empirical evidence [18] from the classification task demonstrated that the amalgamation of context and entity mentions outperformed the other two methodologies. Therefore, in this paper, a fusion of context and entity mention is adopted, while simultaneously utilizing weighted layer pooling to augment the contextual representation, thereby maximizing the use of the information conveyed by the sequences themselves.

The works of SimCSE [19] show contrastive learning has made a splash in unsupervised text classification, a method that can accurately capture sample differences. Supervised contrastive learning [20] proposes a loss that introduces contrastive learning from the unsupervised domain into the supervised field; the connection between supervised contrastive loss and the triplet loss is also explored.

Peng et al. [17] redesign the pre-training task of PLMs with the aim of ensuring that sentences sharing similar relations exhibit analogous representations, while those with different relations manifest distinct disparities. This innovative approach combines contrastive pre-training task with masked language modeling task in the overall model training objective. The clear advantage of this methodology is its partial bridging of the gap between PLM pre-training tasks and relation extraction tasks. However, it comes at the cost of increased computational resources and suffers from low reusability, necessitating the re-pre-training of the entire model for different PLM types, rendering it unsuitable for modular application across multiple domains.

CLERE (Contrastive Learning and Enriched representation for Relation Extraction) opts against full re-pre-training, instead integrating comparative learning directly into the training phase. Moreover, to address the gap between pre-training and RE tasks, CLERE considers refined strategies for selecting input sentences and labeling entity embeddings. Additionally, it delves into the structural intricacies of PLMs at each layer, analyzing their compositions and scrutinizing the combination of embeddings obtained at various layers to yield a more nuanced and semantically rich representation. During inference, the closest one to the sequence from the candidate labels is selected as the final result. CLERE enhances the performance of existing RE models and incorporates the valuable information provided by labels, making it particularly effective in scenarios where there is data imbalance.

To assess the efficacy of CLERE, experiments are conducted on three supervised RE dataset, utilizing BERT-base and Roberta-large as PLMs, respectively. The outcomes demonstrated that the baseline models are surpassed by CLERE. Additionally, the reasons behind the model’s success are investigated. To summarize, our contributions can be outlined as follows.

  • The combination of semantic information of labels with contextual information is explored, and considerable differences in the semantic information between the different labels are found. Additionally, the role of pooling strategies in generating contextual semantic embeddings is explored.

  • A relation extraction model that enhances PLMs embedding ability and applies the concept of contrastive learning to leverage label semantic embeddings is proposed. This brings the semantics of the text closer to the positive labels and moves away from the negative ones.

  • Experiments are done on three public dataset on which CLERE achieved above-baseline results, and higher recall and higher F1 scores are achieved when using the same PLMs.

Related Work

Supervised RE

Supervised RE is a well-researched area within NLP, and early methods used primarily feature-based and kernel-based approaches. Feature-based approaches [21] involve the design of features for entities and their corresponding contexts, including lexicon, syntax, and semantics, which are then fed into an entity-relation classifier. With the advent of SVM, kernel-based approaches have also received considerable attention, with kernel functions designed to obtain similarities between relation representations and text instances. However, feature-based methods heavily rely on manually crafted features, which require researchers to possess domain-specific background knowledge. Kernel-based methods require the use of natural language processing toolkits to transform input text into syntactic dependency trees, which can result in a relatively high probability of error propagation.

Deep learning-based methods have also been used for supervised RE. Liu et al. [22] were one of the first to use CNNs for this task, but this method still requires the use of NLP toolkits. The idea of entity position embedding was introduced by Zeng et al. [23], which later served as the foundation for entity awareness. However, the use of fixed-size convolution kernels in this method resulted in the loss of global features. To address this issue, Nguyen et al. [24] utilized convolution kernels of multiple sizes, which focused on both local and global features. Zhang et al. [25] employed Bi-RNN for RE tasks. To mitigate the problem of RNN gradient explosion, Xu et al. [26] proposed a model with an LSTM structure, which proved effective in extracting sentence-level features in RE tasks. However, the features extracted by this model are still insufficient to achieve optimal performance.

With the advent of PLMs, the landscape of NLP has been revolutionized. This progress has been propelled by the introduction of the Transformer architecture, which features a self-attention mechanism. Among these models, BERT [6], trained on a large corpus, has demonstrated an unparalleled ability to capture textual features. The majority of RE that are based on PLMs utilize BERT or one of its variants as a PLM [27]. These models can be broadly classified into two main categories. One is to revamp the pre-training task by enhancing the internal structure of BERT. Roberta [7] uses a larger dataset and a novel dynamic masking technique to provide a higher level of understanding of sentence context, leading to better performance on multiple NLP tasks including RE. KnowBERT [28] introduces an external knowledge base and improves the training objectives of BERT by constructing the entities in the knowledge base as a triad, thus achieving even more advanced performance in text understanding. LUKE [29] has made a significant breakthrough in the field of PLMs by enhancing entity perception on top of BERT. By incorporating entity types and attributes into the representation process, LUKE has achieved performance that outperforms BERT on tasks such as entity perception and question-answering systems. In general, the advantage of this approach lies in its ability to facilitate the learning of task-specific language representations in a more directed manner, thus circumventing overfitting during subsequent fine-tuning and enhancing the generalization performance of the model. Nonetheless, the downside of this method is apparent: the redesign of the pre-training task is computationally expensive and poses greater demands on the model structure and training process design. Moreover, pre-training tasks that are tailored to specific domains may only be suitable for a particular task and cannot be extrapolated to other tasks. The second approach, fine-tuning, is widely employed today. In this approach, task-specific components are added to the PLMs, enabling the model to achieve advanced performance without requiring further pre-training. R-BERT [11] is an advanced model for relational extraction, based upon the mighty BERT architecture. It enhances BERT’s ability to model relationships by introducing token-level relational representations. MTB [10] suggests that using partial embeddings of entities can achieve even better entity representation for RE. REDN [30] argues that the relationship is determined by the relevance of the subject and object entities, and the representation of the relationship should be a matrix rather than a one-dimensional array. A corresponding loss function is also proposed in REDN. The advantage of this approach is that high performance can be achieved by only designing fine-tuning modules for PLMs while consuming fewer computational resources. The disadvantage is that a large training dataset specific to the task is required, which must have significantly more domain-specific properties than the pre-trained dataset. Additionally, the fine-tuning model is prone to overfitting when the fine-tuning dataset differs significantly from the pre-training dataset.

Contrastive Learning

Contrastive learning has become a mainstream unsupervised learning method in recent years. It is assumed that there is a semantic relationship between \(\alpha _i\) and \(\alpha _i^+\); let \(R_i\) and \(R_i^+\) serve as representations of \(\alpha _i\) and \(\alpha _i^+\). With a mini-batch of N-pairs (\(\alpha _i\), \(\alpha _i^+\)), the training goal is

$$\begin{aligned} ConLoss=-\log \frac{e^{{\text {sim}}\left( \textbf{R}_{i}, \textbf{R}_{i}^{+}\right) / \tau }}{\sum _{j=1}^{N} e^{{\text {sim}}\left( \textbf{R}_{i}, \textbf{R}_{j}^{+}\right) / \tau }}, \end{aligned}$$
(1)

where \(\tau \) is a temperature hyperparameter as well as sim (\(h_1\), \(h_2\)) represents the cosine similarity \(\frac{\textbf{R}_{1}^{\top } \textbf{R}_{2}}{\left\| \textbf{R}_{1}\right\| \cdot \left\| \textbf{R}_{2}\right\| } \cdot \)

Hadsell et al. [31] proposed an algorithm for “learning comparable distances,” which maps samples of the same category to a tight space and samples of different categories to a more distant space. Contrastive learning methods have evolved and introduced contrastive loss, which learns discriminative feature representations by minimizing the distance of similar samples and maximizing the distance of dissimilar samples. In the field of NLP, contrastive learning methods such as RankCSE [32], SimCSE [19], and BERT-CL [33] have been used to address text similarity matching problems and enhance the representation of BERT through contrastive learning, respectively. In the RE field, Peng et al. [17] propose a contrastive learning framework with entity mention, where examples that are defined to be adjacent are clustered together and those that are not are pushed apart. This model’s training objective combines the contrastive learning objective with the masked language modeling objective. Finally, Chen et al. [34] apply the contrastive learning idea to remotely supervised relational extraction, further demonstrating the versatility and potential of contrastive learning in NLP.

Furthermore, Khosla et al. [20] have noted that triplet loss represents a particular instance of contrastive loss, specifically when only one positive and one negative sample are utilized. Unlike the standard contrastive loss, triplet loss operates on triplets, consisting of an anchor, a positive, and a negative sample. Its objective is to minimize the distance between the anchor and the positive sample, while simultaneously maximizing the distance between the anchor and the negative sample. Consequently, triplet loss aims to ensure that the distance between the anchor and positive samples is smaller than that between the anchor and negative samples by at least a specified margin, failing which incurs a loss penalty. In contrast, N-Pair Loss emerges as a more suitable solution for addressing scenarios involving 1 positive and N negative, N positive and N negative. As an extension of triplet loss, N-Pair Loss harnesses the advantages of leveraging information from multiple negative samples in each update iteration, aiming to guarantee that the embedding of the current sample is distinctly distant from all types of negative samples. However, when the quantity of negative samples is substantial, the model may encounter challenges in convergence or might become susceptible to local optima. Moreover, the computational complexity of N-Pair Loss escalates exponentially compared to triplet loss since it necessitates computing the similarity score between the anchor sample and all negative samples.

Fig. 3
figure 3

The framework of CLERE consists of an input layer, an embedding layer, a loss learning layer, and an inference layer

Fig. 4
figure 4

An example of an input sentence using typed entity markers (punct)

CLERE

Overview

The model framework is illustrated in Fig. 3. The training and inference process of CLERE can be divided into two steps; in step 1: Training process, the input of the model will be divided into three parts: sentences, positive labels, and negative labels. The position of entities in sentences will be marked using entity mention, which will be added by placing special markers(“#” and “$”) around the entities. The positive instance label refers to the original label of the instance, while the negative instance label is randomly selected from a set of labels, both of which are text-based. PLMs are employed to encode these three parts. The instance that is encoded will serve as the anchor, the positive label will be treated as the “pos” term, and the negative label will be treated as the “neg” term. The objective of the training is to minimize the distance between the anchor and the “pos” term while maximizing the distance between the anchor and the “neg” term. During the step 2: Inference process, the model selects the label with the shortest distance to the anchor as the prediction.

Problem Descriptions

Supervised RE at the sentence level is focused. Specifically, given an instance that contains the sentence X, the location and the entity type of Subject and Object, the task is to determine which predefined relation the entity pair belongs to. In other words, this is a classification task that aims to select the most appropriate label from a set of predefined relationship types.

Input Embedding

Sentence Embedding

In contrast to other NLP tasks, sentence embedding for RE is focused on maximizing the amount of information pertaining to the entities within the sentence. Building on previous work [10], the Typed Entity Markers (punct) [8] is utilized to represent the entities. Specifically, the “#” marker denotes the subject entity, and the “$” marker denotes the object entity. Additionally, the entity type information in textual form uses “\(*\)” and “\(\wedge \)” to mark the position of the entity type. To capture the sequence semantics of the sentence, the “[CLS]” and “[SEP]” are also added. The final input sentence to the PLMs takes the following form: [CLS]... # \(*\) subj_type \(*\) SUBJ #... $ \(\wedge \) obj_type \(\wedge \) OBJ $.... [SEP] (Fig. 4).

Fig. 5
figure 5

a Triplet loss. It seeks to push Neg outside the circle defined by the margin and pull Pos inside. b Loss function display

After pre-processing the input sentences, they are fed into PLMs to obtain embeddings. It is suggested by Peng et al. [17] that sufficient embedding information for RE can be provided by combining sentence contextual embedding with entity mention, while using only entity mention may lead to shortcuts in the model. Therefore, combining contextual embedding with the entity mention in this study. It is common practice to use the embedding of “[CLS]” when obtaining the sentence context. However, the contextual embedding obtained using this method is flawed, as it cannot capture the complete semantics of the sentence; the advantages and disadvantages of this approach will be discussed in Section “Experiments Analysis.” In this paper, the preprocessed parts of the sentence are initially input into the PLM to generate a set of hidden states X for layers of the PLM as

$$\begin{aligned} X = PLM(sentence). \end{aligned}$$
(2)

Then, the weighted layer pooling [35] method is utilized to obtain contextual embeddings as

$$\begin{aligned} Seq = \frac{\sum _{i=1}^N \omega _{i} x_{i}}{\sum _{i=1}^N \omega _{i}}, \end{aligned}$$
(3)

where \(x_i\in X, i\in (1, N)\), and N refers to the number of layers selected from the PLM; in this paper, N = 4. The weight parameter \(\omega _{i}\) is a learnable parameter that is initialized as a random matrix drawn from a uniform distribution.

After obtaining the contextual embeddings of the sequence through weighted layer pooling, next further extract embeddings \(H_1\) and \(H_2\) from the two entities mentioned in the sentence and then concatenated them with the contextual embedding sequence. The concatenated embeddings are input into a fully connected layer, followed by an activation function. Then, the anchor can be obtained An as

$$\begin{aligned} An = LeakyReLU(W_f(concat(Seq, H_1, H_2))), \end{aligned}$$
(4)

where \(W_f \in \textbf{R}^{d \times 3d}\) (d is the hidden state size of PLMs).

Label Embedding

During the training step, to obtain the embeddings of the labelled text, the label is treated as normal text, i.e., remove the special characters(“/” and “_”) from the labels. Regarding the selection of positive and negative label pairs, the label of the sample itself is selected as the positive label \(y \in Y\) and randomly select a label \(\tilde{y} \in Y/y\) as the negative label. The label pair [y, \(\tilde{y}\)] is then fed into PLMs to obtain the label embedding as

$$\begin{aligned} Pos, Neg = PLMs(y, \tilde{y}). \end{aligned}$$
(5)

Training Objective

After the An, Pos, and Neg embedding matrices are obtained, the triplet loss function is utilized to minimize the semantic distance between the input and its positive label, while increasing the distance between the input and the negative label in the embedding space. It is achieved by applying the triplet loss function, thus enhancing the semantic correlation between the An and the Pos during the inference step, as illustrated in Fig. 5a. The distance between the An and the Pos is referred to as pos_dist, while the distance between the An and the Neg is referred to as neg_dist. Figure 5b depicts the variation of loss with respect to pos_dist and neg_dist, and it is evident that when pos_dist is at its minimum and neg_dist is at its maximum, the loss will decrease.

In the case of an anchor accompanied by its corresponding positive and negative instances, the function is mathematically formulated as exemplified in (7). The computation of the cosine similarity between the An and Pos and the cosine similarity between the An and Neg is performed as (6):

$$\begin{aligned} \text {C(An, Pos)} = \frac{An\cdot Pos}{\Vert An\Vert \Vert Pos\Vert }, \text {C(An, Neg)} = \frac{An\cdot Neg}{\Vert An\Vert \Vert Neg\Vert }. \end{aligned}$$
(6)

The incorporation of a margin parameter to regulate the degree to which the distance between the An and Pos is smaller than that between the An and its Neg, thus preventing the model from taking shortcuts. During the training process, the objective of the model is to minimize the triplet loss and acquire the optimal embedding approach for the given data.

$$\begin{aligned} \mathcal {L}_{a, p, n} = \max (C(An, Pos) - C(An, Neg) + \text {margin}, 0). \end{aligned}$$
(7)

The comprehensive training objective of the model is demonstrated in (8):

$$\begin{aligned} \mathcal {L}_{all}=\frac{1}{N_{t}} \sum _{a \in A } \sum _{p \in P_{t}} \mathcal {L}_{a, p, n}, \end{aligned}$$
(8)

where A refers to the assemblage of training dataset, \(P_t\) denotes the set of positive labels, and \(N_t\) represents the total count of unique pairs comprising the training sentences and their corresponding affirmative labels.

In the inference phase, as the number of relationships in the dataset is fixed, no relations in the validation and test sets ever appeared in the training set. By utilizing the semantic similarity calculation of the model, the closest one to anchor from the set of labels in the test set is employed as the prediction result. To better illustrate the steps of the CLERE task, Algorithm 1 shows the peculiarities of training and inference for the CLERE.

Algorithm 1
figure a

The training and inference process of CLERE.

Experiments

This section introduces the dataset being implemented, alongside the experimental parameter configurations, metrics, and baseline model against which comparisons are made. Subsequently, the experimental of the proposed approach is presented. Finally, the analysis of the observed outcomes is summarized.

Dataset

Three versions of the TACRED [36] dataset will be used to evaluate CLERE: the original TACRED dataset, the TACREV [37] dataset, and the RE-TACRED [38] dataset. The particulars regarding those dataset can be observed within Table 1.

With 42 relations (including “no_relation”), the TACREDFootnote 1 dataset is one of the most extensive dataset to be used for supervised RE, and it is worth noting that the absolute majority of relationships in the dataset are “no_relation.”

TACREVFootnote 2 is a modified version of the TACRED, where some of the errors in the validation and test sets of the TACRED dataset have been corrected, while the training set remains unchanged. Forty-two relations are retained in TACREV.

RE-TACREDFootnote 3 is another version of the TACRED dataset that complements some of the shortcomings of the original version by reconstructing the training, validation, and test blocks of the original version. RE-TACRED is even more so with only 40 relations.

Table 1 Statistics of different dataset

Baselines

To assess the effectiveness of CLERE, a diverse set of existing approaches is compared with CLERE, including CNN-based [22, 23], RNN-based, GCN-based, and Transformers-based methods. PA-LSTM [36] combines a bi-directional LSTM sequence model with entity location-aware attention. C-GCN [39] utilizes GCNs [40] to encode sentences with dependency structures and predicts relations based on them. The C-GCN shows that dependency-based and sequence-based models have a complementary role. SpanBERT [41] is a PLM that builds on BERT [6] by enhancing the masking process for contiguous entities, removing the next sentence prediction (NSP) task, and introducing the span boundary objective (SBO) training target. These improvements have resulted in a significant enhancement over BERT for extractive tasks. KnowBERT [28] improves upon BERT by integrating multiple real-world knowledge bases into its pre-training process, with the aim of enhancing its coverage of real-world knowledge. As a result, KnowBERT exhibits superior performance in downstream tasks such as entity extraction, RE, and disambiguation. LUKE [29] is a specialized representation designed for entity-related tasks that incorporate an entity-aware self-attention mechanism. This attention mechanism allows the model to focus more on the entities in the corpus, resulting in superior performance on downstream tasks. Both MTB [10] and RIB [8] demonstrate that the quality of representations generated by PLMs can be further improved by utilizing only the links between entities.

Table 2 Hyperparameters in experiments

Model Configuration and Metrics

To ensure a fair comparison with other models, the official publicly available code provided in the papers is utilized, while adhering to the recommended hyperparameters. CLERE is implemented using the Transformers package of HuggingFace,Footnote 4 and training is performed using the Adam optimizer. All experimental results are reported as the average of 5 experiments using different random seeds. The details of the experimental hyperparameters are shown in Table 2.

To evaluate the performance of CLERE, Mirco F1 (11) is adopted as a metric, which is a common metric used in previous works. Mirco F1 takes into account the precision (9) and recall (10) of the classifier and is used to evaluate the overall performance of multi-classification problems. Mirco F1 computes the F1 score for each class and then averages them by weighting the number of samples in each class. This ensures that each class has an equal impact on the overall results.

$$\begin{aligned} Precision= & \frac{\sum _i TP_i}{\sum _i (TP_i + FP_i)},\end{aligned}$$
(9)
$$\begin{aligned} Recall= & \frac{\sum _i TP_i}{\sum _i (TP_i + FN_i)},\end{aligned}$$
(10)
$$\begin{aligned} F1= & 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}. \end{aligned}$$
(11)
Table 3 Precision, recall, and F1(in %) on TACRED and TACREV dataset

Main Results

Table 3 shows the results of CLERE on TACRED and TACREV dataset. Table 4 shows the results of CLERE on RE-TACRED dataset.

CLERE outperforms these baselines on two of the dataset, achieving the same level of results as the current state-of-the-art on the last dataset, including attaining an F1-score of 74.9% on the TACRED dataset, 83.9% on the TACREV dataset, and 91.1%on the RE-TACRED, which is comparable with the current state-of-the-art. Superior performance is still achieved by CLERE without any extended dataset and further pre-training steps being used when contrasted with models such as SpanBERT, KnowBERT, and LUKE among the many Transformers-based methods. To validate the robustness of CLERE, even with a modest-sized PLM, BERT-base is selected as the PLM and performs experiments on the identical dataset comparing those baselines using the same PLM (including KnowBERT and SpanBERT); the comparison on the TACRED and TACREV dataset is still outperformed by CLERE and achieves the highest recall in RE-TACRED.

In addition, the results show that CLERE has a higher recall compared to the fine-tuned models when using the same PLM, which validates our conjecture of introducing label information into the training process and using the idea of contrastive learning to solve data imbalance. The models that use retraining and the fine-tuned models have achieved higher precision beyond CLERE, but lower recall and F1 score, again illustrating the advanced nature of CLERE. Overall, the efficacy of CLERE has been well demonstrated by the above experimental results.

Table 4 Precision, recall, and F1(in %) on RE-TACRED dataset

Experiments Analysis

In this section, an analysis will be conducted to determine why the model’s pooling strategy may have a positive impact on the model’s performance. Furthermore, the performance of CLERE on an unbalanced dataset and the selection of the margin parameter in the triplet loss function will be discussed. Lastly, a case study will be presented to illustrate the inference steps of CLERE.

Ablation Experiments

The structure of transformers is frequently fine-tuned by incorporating an additional output layer for downstream tasks or models. The final layer of representations of PLMs is utilized as the default input for downstream tasks or models by researchers. However, PLMs are multi-layer structural models, and the representations of different levels are captured by different layers. Different granularities of feature information are exhibited at different levels. A pivotal point in the fine-tuning task is to obtain the optimal feature information from each level when the downstream tasks differ. Analysis as Fig. 6 in question depicts the self-attention distribution within the partial self-attention layer of PLMs (using BERT-base as an example). The illustration of attention distribution enables a refined scrutiny of the model’s allocation of attention to various segments of the input text within each self-attentive layer [42, 43], endowing an intricate insight into the model’s underlying mechanisms.

Fig. 6
figure 6

Self-attention distribution of layers from BERT-base

The attention of individual tokens to each other is depicted in the diagram, with the thickness of the lines indicating the corresponding attention values as shown in Fig. 6. Notably, a relatively even attention distribution among tokens is observed in the first layer. However, by the second layer, attention becomes notably concentrated on the “[CLS]” token, with subsequent attentional allocation shifting towards the “[SEP]” token by the seventh layer. The most significant attentional focus is observed on three tokens in the final layer, indicating a continual shift in attentional allocation throughout training. This observation highlights the inadequacy of relying solely on the last layer’s hidden state as a contextual embedding of the sequence. In our experiments, it is found that considerable attention is placed on non-single tokens in the last four layers of PLMs. This led us to weighted layer pooling, where the contextual representation of the sequence is obtained by combining multiple layers of hidden states. The combination of multiple layers of hidden states as the contextual representation of the sequence is found to be experimentally validated, supporting the validity of our conclusion. Table 5 shows that the weighted layer pooling strategy on the last 4 layers consistently outperformed the other three strategies across all three dataset, demonstrating the effectiveness of this approach for contextual embedding. The ablation experiments also highlighted the importance of carefully selecting pooling strategies to achieve optimal performance in NLP tasks.

Table 5 Ablation experiments on three dataset, using various pooling strategies (F1 scores in %)

Performance on Imbalanced Dataset

Let us take a look at the recall performance of our proposed NLP model on three dataset. Figure 7 provides a clear comparison; using Bert-base as the PLM part, CLERE has better recall performance than most other models. When using Roberta_large, CLERE has the highest recall compared to the control models. This indicates that CLERE can successfully capture a significant portion of the target tokens or categories in the dataset and can effectively identify the target samples in the dataset. This suggests that CLERE is effective in handling unbalanced dataset. The model’s robustness is also verified by evaluating its F1 score performance in conjunction with its recall performance.

Fig. 7
figure 7

Recall of the models on three dataset. a Shows the recall on the TACRED dataset. b Shows the recall results on the TACREV dataset. c Shows the recall results on the Re-TACRED dataset

Fig. 8
figure 8

The sensitivity of margin. Precision, recall, and F1-score on three dataset: a On the TACRED dataset. b On the TACREV dataset. c On the RE-TACRED dataset

Sensitivity of Margin

The margin’s sensitivity is explored as an essential parameter for triplet loss. It controls the model to differentiate between anchor, positive, and negative examples to make correct judgments during the inference step. From Fig. 8, it can be observed that the effect of the margin on the model is significant. As evidenced by the results obtained from the TACRED (Fig. 8a) and TACREV(Fig. 8b), it can be observed that the classification efficacy of the model diminishes considerably when the margin is increased to 0.35. This can be attributed to the inherent characteristics of the triplet loss function. The larger margins impede the model’s ability to differentiate between positive and negative instances. In contrast, for the RE-TACRED (Fig. 8c), the issue of label imbalance is less pronounced than in the first two dataset, resulting in a diminished sensitivity of the model to variations in margin size. Even when the margin is increased to 0.5, the model maintains a satisfactory level of performance. Based on these observations, it can be concluded that CLERE is well-suited for handling unbalanced dataset and that employing a smaller margin facilitates the model’s ability to distinguish between positive and negative instances. For dataset with a more balanced distribution of labels, the model exhibits significantly reduced sensitivity to changes in margin size.

Case Study

Several examples have been listed in Table 6 to provide a clear idea of CLERE. As can be observed from the second sentence, two identical inference results are given by the model, but their inference distances differ, indicating that shortcuts are not taken during the training process. It can also be observed that the model’s inference for similar relationships is much closer than that for dissimilar relationships. For instance, in sentence 2, the inference of “no_relation” is closer than that of labels of type “per” and type “org,” which is determined by the text semantics of the labels. This provides sufficient evidence for the feasibility of involving semantic information of the labels in the training of the classification task. Moreover, it is found that even if the inferred results are of the same type as the ground truth, such as both “per” types in sentence 1 and “org” types in sentence 3, the model’s distance calculation for them is distinguished by more significant differences, illustrating the clarity of CLERE’s recognition of label semantics. In summary, the information provided by the data itself has been fully utilized by CLERE.

Table 6 Case study
Table 7 Error analysis, indicates marked entities, indicates ground truth of the example, indicates that the inference is consistent with ground truth, indicates that the inference is not consistent with ground truth

Error Analysis

Error analysis plays a critical role in identifying model weaknesses, enhancing dataset quality, refining model design, and ultimately improving overall model performance. Table 7 presents a subset of CLERE’s inference results on the TACRED dataset, along with predicted outcomes for MTB and RIB. In the inference results for sentence #1, both MTB and RIB yield incorrect predictions as the relationship between the subject and object is not labeled as “no_relation” by the dataset. Notably, the candidate relations predominantly involve “person” entities, with MTB and RIB inferring the result as “per:age,” likely due to the prevalent association between the entities involved. CLERE learns by discerning discrepancies among samples. For instance, if in the training data, the object’s type is “DURATION” and the samples are labeled as “no_relation”. Consequently, when CLERE encounters analogous situations in the test dataset, the encoding outcome for the test sample will inherently exhibit a diminished semantic distance from “no_relation.” In sentence #2, all three methods give incorrect predictions for two possible reasons. Firstly, the subject_type and object_type are “ORGANISATION” and “PERSON,” respectively, which have a high similarity within CLERE’s candidate relation. Furthermore, from a human perspective, there is a relationship between “Herman Cain” and “National Restaurant Association,” which is marked as “no_relation” in the TACRED dataset. Therefore, the imperfections of the dataset could contribute to the model’s incorrect predictions. Secondly, based on the statistics from Zhang et al. [36], the proportion of samples with sentence lengths greater than 60 is 3.21%, and this particular sample has a length of 61. Consequently, the ability of the three models to understand long sentences on this dataset is limited, mainly due to the lack of training data for long sentences. Addressing this limitation is of significant practical importance for subsequent improvements. In real-world application scenarios, models often encounter a large number of long sentences, which further emphasizes the need for improved training data coverage in this regard. For sentence #3, the majority of sentence lengths in the TACRED dataset are concentrated between 20 and 42. Consequently, the models achieved their best prediction results when dealing with samples falling within this sentence length range. Overall, based on the error analysis, future efforts should prioritize improving the models’ ability to understand longer sentences. CLERE will continue to improve following the work of Zhuang et al. [44] and Wang et al. [45]

Conclusion

In this paper, the RE model is enhanced by improving the pooling strategy and achieving advanced contextual representations. Based on the idea of contrastive learning, the embedding of label semantic information is introduced into the model’s learning process, alleviating the distress caused by label imbalance in the dataset. The reasons behind the effect of pooling strategies on contextual embeddings are scrutinized and conducted experiments to investigate their influence on model learning outcomes. The semantic similarity of the labels in the dataset is calculated and find that different labels have significant semantic differences and can be strictly distinguished. Then, the experimental results of the model are analyzed, and the attention distribution of different levels of PLMs is discussed; the ablation experiments are done on kinds of pooling strategies. Furthermore, the impact of the margin on the model’s performance is also analyzed. Finally, we demonstrated the proposed method clearly through a case study. We hope that more researchers are willing to explore the role of label semantics.