Keywords

1 Introduction

Relation extraction aims to predict the relation between entities based on their context. The extracted relational facts play a vital role in various natural language processing applications, such as knowledge base enrichment [5], web search [32], and question answering [12].

To improve the quality of extracted relational facts and benefit downstream tasks, many efforts have been devoted to this task. Supervised relation extraction is a representative paradigm built upon the closed world assumption [8]. Benefiting from artfully designed network architectures [14, 24, 36] and valuable knowledge in pretrained language model [1, 6, 30, 31], models effectively capture semantic-rich representations and achieves superior results. However, conventional supervised relation extraction suffer from the lack of large-scale labeled data. To tackle this issue, distantly supervised relation extraction has attracted much attention. The existing works mainly focus on how to alleviate the noise generated in the automatic annotation. Common approaches include selecting informative instances [19], incorporating extra information [35], and designing sophisticated training [22].

Fig. 1.
figure 1

Neural models tend to use the simplest way to meet the supervised objective (Shortcut phenomenon [9]), which would lead to negative predictions on unseen relations. Hence, for the unseen relations, we hope neural models can reject prediction through embracing sufficient features.

Although a supervised relation classifier achieves excellent performance on known relations, real-world inputs are often mixed with samples of unseen relations. A well-performing model can still confidently make arbitrarily wrong predictions when dealing with these unseen relations [25, 27]. The unrobustness is rooted in the Shortcut feature [9] of neural networks. Models optimized by a supervised objective does not actively learn features beyond the bare minimum necessary to discriminate between known relations. As shown in Fig. 1, if there is only president relation in the training data between Obama and the United States, the model tends to predict the president relation when it encounters them again. However, entities are not equivalent to relation definitions. Models severely biased to the extraction of overly simplistic features can easily fail to generalize to discriminate between known and unseen relations. As shown in Table 1, when the unseen relations appears in the test set, the supervised RE models’ \(F_1\)-score drops by at least 30 points.

Table 1. Supervised RE models’ performance when encountering new relations. These models are from previous papers [15, 21, 26]. Ori: all relations in the test set are present in the training set. Mix: \(50\%\) of the relations in the test set do not appear in the training set.

In this work, we propose a robust relation extraction method in real world settings. By integrating rejection option, the classifier can effectively detect whether inputs express unseen relations instead of making arbitrary bad predictions. Specifically, we introduce contrastive training techniques to achieve this goal. A set of carefully designed class-preserving transformations are used to learn sufficient features, which can enhance the discriminability between known and unknown relation representations. The classifier built on the learned representation is confidence-calibrated. Thereby samples of unseen relations are assigned a low confidence score and rejected. Off-the-shelf OpenRE methods can be used to discover potential relations in these samples. In addition, we find the rejection can be further improved via the readily available distantly-supervised data. Experimental results show the effectiveness of our method capturing discriminative representations for unseen relation rejection.

To summarize, the main contributions of our work are as follows: (1) We propose a relation extraction method with rejection option, which is still robust when exposed to unseen relations. (2) We design a set of class-preserving transformations to learn sufficient features to discriminate known and novel relations. In addition, we propose to use readily available distantly-supervised data to enhance the discriminability. (3) Extensive experiments on two academic datasets prove the effectiveness of our method capturing discriminative representations for unseen relation rejection.

2 Related Work

2.1 Relation Extraction

Relation extraction has advanced for more than a couple of decades. Supervised/Distantly supervised relation extraction is oriented at predefined relational types. Researchers have explored different network architectures [36], training strategies [22] and external information [35]. Superior results have been achieved. Open relation extraction is oriented at emerging unknown relation. Well-designed extraction forms (e.g. sequence labelling [7], clustering [38]) are used to deal with relations without pre-specified schemas. Different from them, we consider a more general scenario, in which known and unknown relations are mixed in the input. We effectively separate them by a rejection option, which enables us to use the optimal paradigm to deal with the corresponding relations.

2.2 Classification with Rejection Option

Most existing classification methods are based on the closed world assumption. However, inputs are often mixed with samples of unknown classes in real-world applications. The approaches used to handle it roughly fall into one of two groups. The first group calculates the confidence score based on the classifier output. The score can be used to measure whether an input belongs to unknown classes. Maximum softmax probability (MSP) [11] is a represetative method and Liang et al. [17] further improve MSP by introducing temperature scaling. Furthermore, Shu et al. [29] build a multi-class classifier with a 1-vs-rest final layer of sigmoids to reduce the open space risk. The second group considers classification with rejection option as an outlier detection problem. Off-the-shelf outlier detection algorithms [2, 20, 28] are leveraged. Different optimization objectives such as large margin loss [18], gaussian mixture loss [33] are adopted to learn more discriminative representations to facilitate anomaly detection. Recently, Zhang et al. [34] propose to learn the adaptive decision boundary (ADB) that serves as the basis for judging outliers.

3 Approach

In this paper, we propose a robust relation extraction method in real world settings. By integrating rejection option, the classifier can effectively detect whether inputs express unseen relations instead of making arbitrary bad predictions. Off-the-shell OpenRE methods can be used to discover potential relations in these rejected samples.

The problem setting in this work is formally stated as follows. Let \(\mathcal {K}=\{\mathcal {R}_1,...,\mathcal {R}_k\}\) be a set of known relations and \(\mathcal {U}=\{\mathcal {R}_{k+1},...,\mathcal {R}_n\}\) be a set of unseen relations where \(\mathcal {K} \cap \mathcal {U}=\emptyset \). Let \(\mathcal {X}\) be an input space. Given the training data \(\mathcal {D}^\ell =\{(x_i^\ell ,y_i^\ell )\}_{i=1,...,N}\) where \(x_i^\ell \in \mathcal {X}\), \(y_i^\ell \in \mathcal {K}\), we target constructing a mapping rule \(f:\mathcal {X}\rightarrow \{\mathcal {R}_1,...,\mathcal {R}_k,\mathcal {R}^*\}\) where \(\mathcal {R}^*\) denotes rejection option. Let \(\mathcal {D}^u=\{(x_i^u,y_i^u)\}_{i=1,...,M}\) be the testing dataset where \(y_i^u \in \mathcal {K}\cup \mathcal {U}\). An desirable mapping rule f should meet the following objective as much as possible:

$$ f(x)=\left\{ \begin{array}{rcl} y_i^u &{} &{} {y_i^u \in \mathcal {K}}\\ \mathcal {R}^* &{} &{} {y_i^u \in \mathcal {U}}. \end{array} \right. $$

3.1 Method Overview

We approach the problem by introducing contrastive learning techniques. As illustrated in Fig. 2, the proposed method comprises four major components: relation representation encoder \(\boldsymbol{g}(\cdot )\), confidence-calibrated classifier \(\boldsymbol{\eta }(\cdot )\), class-preserving transformations \(\mathcal {T}\), and the OpenRE module.

Fig. 2.
figure 2

An overview of the proposed method. Three steps are included: (1) Contrastive training techniques and a set of class-preserving transformations are utilized to learn sufficient features. (2) The classifier extract known relations and rejects samples of unseen relations according to these features. (3) Off-the-shelf OpenRE method (SelfORE) is incorporated to discovery unseen relations in these rejected samples.

Our overview starts from the first two components. There is no doubt that an encoder and classifier are the basic components of a supervised relation extractor. However, the supervised training objective does not encourage the model to learn features beyond the bare minimum necessary to discriminate between known relations. Consequently, the classifier can misclassify unseen relations to known relations with high confidence.

In order to calibrate the confidence of the classifier, we introduce contrastive learning techniques. Given training batch \(\mathcal {B}\), an augmented batch \(\widetilde{\mathcal {B}}\) is obtained by applying random transformation \(t\in \mathcal {T}\) to mask partial features. Then the supervised contrastive learning objective max/minimize the representation agreement according to whether their relations are the same. By doing this, the model is forced to find more features to discriminate between relations and the classifier can be calibrated. Based on the confidence-calibrated classifier, unknown relations are rejected if the maximum softmax probability of the classifier does not exceed a preset threshold \(\theta \).

In order to discriminate unknown relations rather than just detect their existence, we further integrate the off-the-shelf OpenRE method into our framework. The samples rejected by the classifier are sent to the OpenRE module to detect potential unknown relations.

3.2 Relation Representation Encoder

Given a relation instance \(x_i^\ell =(\boldsymbol{w}_i,h_i,t_i)\in \mathcal {D}^\ell \) where \(\boldsymbol{w}_i=\{w_1,w_2,...,w_n\}\) is the input sentence and \(h_i=(s^h,e^h)\), \(t_i=(s^t,e^t)\) mark the position of head and tail entities, relation representation encoder \(\boldsymbol{g}(\cdot )\) aims to encode contextual relational information to a fixed-length representation \(\boldsymbol{r}_i=\boldsymbol{g}(x_i)\in \mathbb {R}^d\). We opt for simplicity and adopt the commonly used BERT [4] to obtain \(\boldsymbol{r_i}\) while various other choices of the network architecture are also allowed without any constraints. Formally, the process of obtaining \(\boldsymbol{r_i}\) is:

$$\begin{aligned} \boldsymbol{h}_1,...,\boldsymbol{h}_n=\textrm{BERT}(w_1,...,w_n)\end{aligned}$$
(1)
$$\begin{aligned} \boldsymbol{h}_{ent}=\textrm{MAXPOOL}(\boldsymbol{h}_s,...,\boldsymbol{h}_e)\end{aligned}$$
(2)
$$\begin{aligned} \boldsymbol{r}_{i}=\left\langle \boldsymbol{h}_{head}|\boldsymbol{h}_{tail}\right\rangle , \end{aligned}$$
(3)

where \(\boldsymbol{h}_1,...,\boldsymbol{h}_n\) is the result of the input sentence after BERT encoding, subscript s and e represent the start and end positions of the entity, \(\boldsymbol{h}_{ent}\) represents the result of the maximum pooling of the entity, \(\boldsymbol{h}_{ent}\) can be divided into head entity \(\boldsymbol{h}_{head}\) and tail entity \(\boldsymbol{h}_{tail}\), and \(\left\langle \cdot |\cdot \right\rangle \) is the concatenation operator.

3.3 Confidence-Calibrated Classifier

In order to alleviate overconfidence to unseen relations, we introduce contrastive learning techniques to calibrate classifier. A well-calibrated classifier should not only accurately classify known relations, but also give low confidence to unseen relations, that is, \(\max _yp(y|x)\).

Given a training batch \(\mathcal {B}={(x_i^\ell ,y_i^\ell )}_{i=1}^B\), we obtain a augmented batch \(\widetilde{\mathcal {B}}={(\widetilde{x}_i^\ell ,y_i^\ell )}_{i=1}^B\) by applying random transformation \(t\in \mathcal {T}\) on \(\mathcal {B}\). For brevity, the superscript \(\ell \) is omitted in the subsequent elaboration of this section. For each labeled sample \((\widetilde{x}_i, y_i)\), \(\widetilde{\mathcal {B}}\) can be divided into two subsets \(\widetilde{\mathcal {B}}_{y_i}\) and \(\widetilde{\mathcal {B}}_{-y_i}\). \(\widetilde{\mathcal {B}}_{y_i}\) denotes a set that contains samples of relation \(y_i\) and \(\widetilde{\mathcal {B}}_{-y_i}\) contains the rest. The supervised contrastive learning objective is defined as follows:

$$\begin{aligned} \mathcal {L}^{sup}_{cts}(\mathcal {B}, \mathcal {T})=\frac{1}{2B}\sum _{j=1}^{2B}\mathcal {L}_{cts}(\widetilde{x}_i,\widetilde{\mathcal {B}}_{y_i}\backslash \{\widetilde{x}_i\},\widetilde{\mathcal {B}}_{-y_i})\end{aligned}$$
(4)
$$\begin{aligned} \mathcal {L}_{cts}(x,\mathcal {D}^+,\mathcal {D}^-)=-\frac{1}{|\mathcal {D}^+|}log\frac{\sum _{x^\prime \in \mathcal {D}^+}q(x,x^\prime )}{\sum _{x^\prime \in \mathcal {D}^+\cup \mathcal {D}^+}q(x,x^\prime )}\end{aligned}$$
(5)
$$\begin{aligned} q(x,x^\prime )=\exp (sim(\boldsymbol{z}(x),\boldsymbol{z}(x^\prime ))/\tau ), \end{aligned}$$
(6)

where \(|\mathcal {D}|\) denotes the number of samples in \(\mathcal {D}\), \(sim(x,x^\prime )\) denotes the cosine similarity between x and \(x^\prime \) and \(\tau \) denotes a temperature coefficient. Following Chen et al. [3], we use a additional projection layer \(\boldsymbol{t}\) to obtain the contrastive feature \(\boldsymbol{z}(x)=\boldsymbol{t}(\boldsymbol{g}(x))\).

Benifiting from contrastive training, the encoder \(\boldsymbol{g}(\cdot )\) learns rich features to discriminate between known and novel relations. Accordingly, we train a confidence-calibrated classifier \(\boldsymbol{\eta }(\cdot )\) upon \(\boldsymbol{g}(\cdot )\) as follows:

$$\begin{aligned} \mathcal {L}=\mathbb {E}_{(x,y)\sim \mathcal {D}^\ell }[\mathcal {L}_{ce}(\boldsymbol{\eta } (\boldsymbol{g}(x_i)),y)], \end{aligned}$$
(7)

where \(\mathcal {L}_{ce}\) is the cross entropy loss. In addition, we can easily obtain a large number of training data \(\mathcal {D}^{dist}\) through distant supervision. None of the \(y_i^{dist}\) in \(\mathcal {D}^{dist}\) are known relation, that is, \(\{y_i^{dist}\}\cap \{y_j^\ell \}=\emptyset \). These data are only used as negative examples, so the noise in the data will not be a problem. We force the classifier output distribution of negative examples to approximate the uniform distribution by optimizing the cross-entropy between them. Using \(\mathcal {D}^{dist}\), we optimize model by following objective instead of Eq. 7.

$$\begin{aligned} \mathcal {L}^{dist}=\mathcal {L}+\lambda \mathbb {E}_{x\sim \mathcal {D}^{dist}}[\mathcal {L}_{ce}(\boldsymbol{\eta }(\boldsymbol{g}(x)),y_{uni})], \end{aligned}$$
(8)

where \(\mathcal {L}\) refers to the optimization objective of Eq. 7. \(\lambda \) is the hyperparamters that balances the known relation data and distantly supervised data. We can achieve good results simply by setting \(\lambda \) to 1 without adjustment. \(y_{uni}\) represents a uniform distribution.

Based on the confidence-calibrated classifier, we specify the rejection rule \(f(\cdot )\) as follows:

$$\begin{aligned} f(x_i)=\left\{ \begin{array}{ccl} y &{} &{} {max_yp(y|x_i)>\theta }\\ \mathcal {R}^* &{} &{} {Otherwise}, \end{array} \right. \end{aligned}$$
(9)

where \(\theta \) is a threshold hyperparameters, the posterior probability \(p(y|x_i)\) is the output of classifier \(\boldsymbol{\eta }\) and \(\mathcal {R}^*\) denotes the rejection option.

3.4 Class-Preserving Transformations

Transformations is the core component of contrastive learning. Our intuition in designing transformation is that feature masks at different views force the model to find more features to discriminate between known relations. These new features can play a vital role in recognizing unseen relations. Why do the above methods work? As shown in Fig. 1, due to the shortcut phenomenon, the model is more inclined to remember the relations between entities and it would make mistakes when predicting new relations between the same entity pair. Intuitively through the mask mechanism, the model could mask out some features that belong to Obama and the United States, and then it will have to find more other features to distinguish the president of from other relations. Therefore it will not learn the Shortcut bias of Obama + the United States = the president of. In this work, we design three class-preserving transformations to mask partial features as follows.

Token Mask. Token mask works in the process of sentence encoding. In this transformation, we randomly mask a certain proportion of tokens to generate a new view of relation representation.

Random Mask. Random mask also works in the process of sentence encoding. Instead of completely masking representation of selected tokens, each dimension of the representation of each word is considered independently in this transformation.

Feature Mask. Feature mask works after sentence encoding. Given a relation instance \(x_i^\ell \in \mathcal {D}^\ell \), we first obtain its relation representation \(\boldsymbol{r}_i=\boldsymbol{g}(x_i)\). Then we randomly mask a certain proportion of feature dimensions of \(\boldsymbol{r}_i\) to generate a new view.

It is certain that a more complicated and diverse transformations will bring additional improvement. This will be one of our future work.

3.5 OpenRE Module

We introduce the OpenRE module for the integrity of the framework, although it is not our main concerns. Based on the rejection rules f described in Sect. 3.3, we can classify samples of known relations while rejecting unseen relations. In this section, we take a step forward. By integrating the off-the-shelf OpenRE method, we try to discover the potential unseen relations in the rejected samples instead of only detecting their existence. We adopt SelfORE [13], a clustering-based OpenRE method, as the building block of our OpenRE module. Various other methods can also be used as the alternative to SelfORE without any constraints. More details about OpenRE methods can be found in the related papers. Overall, the method proposed in this paper is detailed in Algorithm 1.

figure d

4 Experimental Setup

In this section, we describe the datasets for training and evaluating the proposed method. We also detail the baseline models for comparison. Finally, we clarify the implementation details.

4.1 Datasets

We conduct our experiments on two well-known relation extraction datasets. In addition, a distantly supervised dataset are used in a auxiliary way.

FewRel. Few-Shot Relation Classification Dataset [10]. FewRel is a human-annotated dataset containing 80 types of relations, each with 700 instances. We use the top 40 relations as known and the middle 20 relations as unseen. Since the relations of FewRel dataset is exactly the same as that of FewRel-Distance, we hold out the last 20 relations for the use of distant supervision. The training set contains 25600 randomly selected samples of known relations. In order to evaluate the rejection performance to the unseen relations, the test/validation set contains 3200/1600 samples composed of known and unseen relations.

TACRED. The TAC Relation Extraction Dataset [37]. TACRED is a human-annotated large-scale relation extraction dataset that covers 41 relation types. Similar to the setting of FewRel, we use the top 31 relations as known and the rest 10 relations as unseen. The training set consists of 18113 randomly selected samples of known relations. The size of validation set and test set are 900 and 1800 respectively, including known and unseen relations. It should be noted that 50% of the unseen relation samples in the validation set and test is no_relation.

FewRel-distant. FewRel-distant contains the distantly-supervised data obtained by the authors of FewRel before human annotation. We use this dataset as the distantly supervised data in our experiments.

4.2 Baselines and Evaluation Metrics

MSP [11]. MSP assumes that correctly classified examples tend to have greater maximum softmax probabilities than examples of unseen classes. Thereby the maximum softmax probabilities are used as confidence score for unseen classes detection.

MSP-TC [17]. MSP-TC uses maximum softmax probabilities with temperature scaling and small perturbations to enhance the separability between known and unseen classes, allowing for more effective detection.

DOC [29]. DOC builds n 1-vs-rest sigmoid classifiers for n known classes respectively. The maximum probability of these binary classifiers is considered as the confidence score for unseen classes detection.

LMCL [18]. Large margin cosine loss (LMCL) aims to learn a discriminative deep representations. It forces the model to not only classify correctly but also maximize inter-class variance and minimize intra-class variance. Based on the learned representations, local outlier factor (LOF) is used to detect unseen classes.

ADB [34]. Labeled known classes samples are first used for representation learning. Then the learned representations are utilized to learn the adaptive spherical decision boundaries for each known classes. Samples outside the hypersphere will be rejected for recognition.

Evaluation Metrics. We follow previous work [18, 34] and take all the unseen relations as one rejected class. The accuracy and macro F1 metrics are used as the scoring function to evaluate the unseen relation detection.

4.3 Implementation Details

We use the Adam [16] as the optimizer, with a learning rate of \(1e-4\) and batch size of 100 for all datasets. If the results don’t improve on the validation set for 10 epochs, we stop the training to avoid overfitting. All experiments are conducted using a NVIDIA GeForce RTX 3090 with 24 GB memory.

5 Results and Analysis

In this section, we present the experimental results of our method on FewRel and TACRED datasets to demonstrate the effectiveness of our method.

5.1 Main Results

Table 2. Main results of unseen relation detection with different known class proportions (25%, 50% and 75%) on two relation extraction datasets. Compared with the best results of all baselines, our method improves \(F_1\)-score by an average of 2.6%, 3.5% on FewRel and TACRED dataset, respectively.
Table 3. Macro \(F_1\)-score of known relation classification with different proportion of known relations.

Our experiments in this section focus on the following three related questions.

Can the Proposed Method Effectively Detect Unseen Relations? To answer this question, we consider all the known relations as one predicted class and the rest unseen relations as one rejected class. Table 2 reports model performances on FewRel, TACRED datasets, which shows that the proposed method achieves state-of-the-art results on unseen relation detection. Benefiting from the contrastive training objectives and the carefully designed transformations, the Shortcut phenomenon is effectively alleviated, and the model learns sufficient features to discriminate between known and unseen relations. Therefore, the proposed method consistently outperforms the compared baselines by a large margin in different mixing-ratio settings.

Does the Detection of Unseen Relations Impair the Extraction of Known Relations? Integrating the rejection option can make the classifier more robust in real applications. However, we do not want the unseen relations detection impair known relations classification, which is the basic function of the classifier. From table 3 we can observe that the proposed model not only effectively detect unseen relations, but also accurately classify known relations. This demonstrate that the designed transformation will not affect the original relational semantics, so the rich features obtained by comparative learning remain discriminability for the known relations.

Fig. 3.
figure 3

ROC curves on two datasets.

Can the Model Achieve Superior Performance Under Different Threshold Settings? We show the receiver operating characteristic (ROC) curve in Fig. 3. The area under ROC curve (AUROC) summarize the performance of a classifier detecting unseen relations across different thresholds. From Fig. 3 we can observe that the AUROC of the proposed method is the largest. Therefore, the proposed method has certain advantages under different threshold settings.

5.2 Ablation Study

To understand the effects of each component of the proposed model, we conduct an ablation study on it and report the results (Macro-\(F_1\)) on the two dataset in Table 4. The results show that the detection of unseen relations is degraded if any transformation is removed. It indicates that (1) These transformations force model learn sufficient features through mask mechanism from different views. The learned features are beneficial for the detection of unseen relations. (2) Since the transformations are from different views, they can be superimposed and further enhance the detection of unseen relations. In addition, we find that distantly supervised data can significantly improve the detection of unseen relations. Because there are a large number of diverse relations in the external knowledge base, we can easily construct a large number of negative samples. So this improvement can be seen as a free lunch.

Table 4. Abalation study of our method.

5.3 Relation Representation Visualization

Fig. 4.
figure 4

Visualization of the relation representation after t-SNE dimension reduction. The representations are colored with their ground-truth relation labels. Black triangles indicate unknown relations. These four from top left to bottom right sequentially illustrate the relation representation of initial state, after supervised optimization, after contrastive optimization, after both of them.

To intuitively show the influence of the rich features learned through contrastive training, we visualize the relational representation with t-SNE  [23]. We select five semantically similar known relations from FewRel dataset, and randomly select 40 samples for each of them. 100 hard samples of unseen relations misclassified by MSP method are selected to show the superiority of our method. From the visualization results in Fig. 4, we can observe that, before training (upper left), the relation representations are scattered in the semantic space. After supervised training (upper right), samples can be roughly divided by relation, but different relations are still close to each other. This is consistent with the Shortcut feature in neural network. We note that samples of unseen relations are mixed with known relation samples. After contrastive training (down left), model learns sufficient features to discriminate unseen relations. Therefore, samples of unseen relations are effectively separated. Finally, a best relation representation are obtained by applying both supervised and contrastive optimization (down right).

5.4 A Case Study on OpenRE

Table 5. Extracted and golden surface-form relation names on TACRED.

For the samples rejected by the classifier, the off-the-shelf OpenRE method can be used to discovery potential unseen relations. In this section, we provide a brief case study to show the discovered unseen relations by SelfORE [13]. OpenRE module outputs the cluster assignment of these rejected samples. We extract the relation names using the frequent n-gram in each cluster and the extraction results are shown in Table 5. By integrating the OpenRE module, our method complete (1) the classification of known relations, (2) the rejection of unseen relations, (3) discovery of unseen relations. Based on the above process, robust relation extraction in real applications is realized.

6 Conclusions

In this work, we introduce a relation extraction method with rejection option to improve the robustness in real-world applications. The proposed method employs contrastive training techniques and a set of carefully designed transformations to learn sufficient features. The classification of known relations and rejection of unseen relations can be done with these features. Unseen relations in the rejected samples can be discovered by incorporating off-the-shelf OpenRE methods. Experimental results show that our method outperforms SOTA methods for unseen relation rejection.