Abstains from Prediction: Towards Robust Relation Extraction in Real World

Zhao, Jun; Zhang, Yongxin; Xu, Nuo; Gui, Tao; Zhang, Qi; Chen, Yunwen; Gao, Xiang

doi:10.1007/978-3-031-18315-7_7

Jun Zhao¹⁴,
Yongxin Zhang¹⁴,
Nuo Xu¹⁴,
Tao Gui¹⁴,
Qi Zhang¹⁴,
Yunwen Chen¹⁵ &
…
Xiang Gao¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13603))

Included in the following conference series:

China National Conference on Chinese Computational Linguistics

499 Accesses

Abstract

Supervised learning is a classic paradigm of relation extraction (RE). However, a well-performing model can still confidently make arbitrarily wrong predictions when exposed to samples of unseen relations. In this work, we propose a relation extraction method with rejection option to improve robustness to unseen relations. To enable the classifier to reject unseen relations, we introduce contrastive learning techniques and carefully design a set of class-preserving transformations to improve the discriminability between known and unseen relations. Based on the learned representation, inputs of unseen relations are assigned a low confidence score and rejected. Off-the-shelf open relation extraction (OpenRE) methods can be adopted to discover the potential relations in these rejected inputs. In addition, we find that the rejection can be further improved via readily available distantly supervised data. Experiments on two public datasets prove the effectiveness of our method capturing discriminative representations for unseen relation rejection.

J. Zhao and Y. Zhang—Equal contribution.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Relation Repository Based Adaptive Clustering for Open Relation Extraction

Hierarchical symmetric cross entropy for distant supervised relation extraction

Article 03 September 2024

End-to-end relation extraction based on bootstrapped multi-level distant supervision

Article 24 April 2020

Keywords

1 Introduction

Relation extraction aims to predict the relation between entities based on their context. The extracted relational facts play a vital role in various natural language processing applications, such as knowledge base enrichment [5], web search [32], and question answering [12].

To improve the quality of extracted relational facts and benefit downstream tasks, many efforts have been devoted to this task. Supervised relation extraction is a representative paradigm built upon the closed world assumption [8]. Benefiting from artfully designed network architectures [14, 24, 36] and valuable knowledge in pretrained language model [1, 6, 30, 31], models effectively capture semantic-rich representations and achieves superior results. However, conventional supervised relation extraction suffer from the lack of large-scale labeled data. To tackle this issue, distantly supervised relation extraction has attracted much attention. The existing works mainly focus on how to alleviate the noise generated in the automatic annotation. Common approaches include selecting informative instances [19], incorporating extra information [35], and designing sophisticated training [22].

Although a supervised relation classifier achieves excellent performance on known relations, real-world inputs are often mixed with samples of unseen relations. A well-performing model can still confidently make arbitrarily wrong predictions when dealing with these unseen relations [25, 27]. The unrobustness is rooted in the Shortcut feature [9] of neural networks. Models optimized by a supervised objective does not actively learn features beyond the bare minimum necessary to discriminate between known relations. As shown in Fig. 1, if there is only president relation in the training data between Obama and the United States, the model tends to predict the president relation when it encounters them again. However, entities are not equivalent to relation definitions. Models severely biased to the extraction of overly simplistic features can easily fail to generalize to discriminate between known and unseen relations. As shown in Table 1, when the unseen relations appears in the test set, the supervised RE models’ $F_1$-score drops by at least 30 points.

Table 1. Supervised RE models’ performance when encountering new relations. These models are from previous papers [15, 21, 26]. Ori: all relations in the test set are present in the training set. Mix: $50\%$ of the relations in the test set do not appear in the training set.

Full size table

In this work, we propose a robust relation extraction method in real world settings. By integrating rejection option, the classifier can effectively detect whether inputs express unseen relations instead of making arbitrary bad predictions. Specifically, we introduce contrastive training techniques to achieve this goal. A set of carefully designed class-preserving transformations are used to learn sufficient features, which can enhance the discriminability between known and unknown relation representations. The classifier built on the learned representation is confidence-calibrated. Thereby samples of unseen relations are assigned a low confidence score and rejected. Off-the-shelf OpenRE methods can be used to discover potential relations in these samples. In addition, we find the rejection can be further improved via the readily available distantly-supervised data. Experimental results show the effectiveness of our method capturing discriminative representations for unseen relation rejection.

To summarize, the main contributions of our work are as follows: (1) We propose a relation extraction method with rejection option, which is still robust when exposed to unseen relations. (2) We design a set of class-preserving transformations to learn sufficient features to discriminate known and novel relations. In addition, we propose to use readily available distantly-supervised data to enhance the discriminability. (3) Extensive experiments on two academic datasets prove the effectiveness of our method capturing discriminative representations for unseen relation rejection.

2 Related Work

2.1 Relation Extraction

Relation extraction has advanced for more than a couple of decades. Supervised/Distantly supervised relation extraction is oriented at predefined relational types. Researchers have explored different network architectures [36], training strategies [22] and external information [35]. Superior results have been achieved. Open relation extraction is oriented at emerging unknown relation. Well-designed extraction forms (e.g. sequence labelling [7], clustering [38]) are used to deal with relations without pre-specified schemas. Different from them, we consider a more general scenario, in which known and unknown relations are mixed in the input. We effectively separate them by a rejection option, which enables us to use the optimal paradigm to deal with the corresponding relations.

2.2 Classification with Rejection Option

Most existing classification methods are based on the closed world assumption. However, inputs are often mixed with samples of unknown classes in real-world applications. The approaches used to handle it roughly fall into one of two groups. The first group calculates the confidence score based on the classifier output. The score can be used to measure whether an input belongs to unknown classes. Maximum softmax probability (MSP) [11] is a represetative method and Liang et al. [17] further improve MSP by introducing temperature scaling. Furthermore, Shu et al. [29] build a multi-class classifier with a 1-vs-rest final layer of sigmoids to reduce the open space risk. The second group considers classification with rejection option as an outlier detection problem. Off-the-shelf outlier detection algorithms [2, 20, 28] are leveraged. Different optimization objectives such as large margin loss [18], gaussian mixture loss [33] are adopted to learn more discriminative representations to facilitate anomaly detection. Recently, Zhang et al. [34] propose to learn the adaptive decision boundary (ADB) that serves as the basis for judging outliers.

3 Approach

In this paper, we propose a robust relation extraction method in real world settings. By integrating rejection option, the classifier can effectively detect whether inputs express unseen relations instead of making arbitrary bad predictions. Off-the-shell OpenRE methods can be used to discover potential relations in these rejected samples.

The problem setting in this work is formally stated as follows. Let $\mathcal {K}=\{\mathcal {R}_1,...,\mathcal {R}_k\}$ be a set of known relations and $\mathcal {U}=\{\mathcal {R}_{k+1},...,\mathcal {R}_n\}$ be a set of unseen relations where $\mathcal {K} \cap \mathcal {U}=\emptyset $. Let $\mathcal {X}$ be an input space. Given the training data $\mathcal {D}^\ell =\{(x_i^\ell ,y_i^\ell )\}_{i=1,...,N}$ where $x_i^\ell \in \mathcal {X}$, $y_i^\ell \in \mathcal {K}$, we target constructing a mapping rule $f:\mathcal {X}\rightarrow \{\mathcal {R}_1,...,\mathcal {R}_k,\mathcal {R}^*\}$ where $\mathcal {R}^*$ denotes rejection option. Let $\mathcal {D}^u=\{(x_i^u,y_i^u)\}_{i=1,...,M}$ be the testing dataset where $y_i^u \in \mathcal {K}\cup \mathcal {U}$. An desirable mapping rule f should meet the following objective as much as possible:

$$ f(x)=\left\{ \begin{array}{rcl} y_i^u &{} &{} {y_i^u \in \mathcal {K}}\\ \mathcal {R}^* &{} &{} {y_i^u \in \mathcal {U}}. \end{array} \right. $$

3.1 Method Overview

We approach the problem by introducing contrastive learning techniques. As illustrated in Fig. 2, the proposed method comprises four major components: relation representation encoder $\boldsymbol{g}(\cdot )$, confidence-calibrated classifier $\boldsymbol{\eta }(\cdot )$, class-preserving transformations $\mathcal {T}$, and the OpenRE module.

Our overview starts from the first two components. There is no doubt that an encoder and classifier are the basic components of a supervised relation extractor. However, the supervised training objective does not encourage the model to learn features beyond the bare minimum necessary to discriminate between known relations. Consequently, the classifier can misclassify unseen relations to known relations with high confidence.

In order to calibrate the confidence of the classifier, we introduce contrastive learning techniques. Given training batch $\mathcal {B}$, an augmented batch $\widetilde{\mathcal {B}}$ is obtained by applying random transformation $t\in \mathcal {T}$ to mask partial features. Then the supervised contrastive learning objective max/minimize the representation agreement according to whether their relations are the same. By doing this, the model is forced to find more features to discriminate between relations and the classifier can be calibrated. Based on the confidence-calibrated classifier, unknown relations are rejected if the maximum softmax probability of the classifier does not exceed a preset threshold $\theta $.

In order to discriminate unknown relations rather than just detect their existence, we further integrate the off-the-shelf OpenRE method into our framework. The samples rejected by the classifier are sent to the OpenRE module to detect potential unknown relations.

3.2 Relation Representation Encoder

Given a relation instance $x_i^\ell =(\boldsymbol{w}_i,h_i,t_i)\in \mathcal {D}^\ell $ where $\boldsymbol{w}_i=\{w_1,w_2,...,w_n\}$ is the input sentence and $h_i=(s^h,e^h)$, $t_i=(s^t,e^t)$ mark the position of head and tail entities, relation representation encoder $\boldsymbol{g}(\cdot )$ aims to encode contextual relational information to a fixed-length representation $\boldsymbol{r}_i=\boldsymbol{g}(x_i)\in \mathbb {R}^d$. We opt for simplicity and adopt the commonly used BERT [4] to obtain $\boldsymbol{r_i}$ while various other choices of the network architecture are also allowed without any constraints. Formally, the process of obtaining $\boldsymbol{r_i}$ is:

$$\begin{aligned} \boldsymbol{h}_1,...,\boldsymbol{h}_n=\textrm{BERT}(w_1,...,w_n)\end{aligned}$$

(1)

$$\begin{aligned} \boldsymbol{h}_{ent}=\textrm{MAXPOOL}(\boldsymbol{h}_s,...,\boldsymbol{h}_e)\end{aligned}$$

(2)

$$\begin{aligned} \boldsymbol{r}_{i}=\left\langle \boldsymbol{h}_{head}|\boldsymbol{h}_{tail}\right\rangle , \end{aligned}$$

(3)

where $\boldsymbol{h}_1,...,\boldsymbol{h}_n$ is the result of the input sentence after BERT encoding, subscript s and e represent the start and end positions of the entity, $\boldsymbol{h}_{ent}$ represents the result of the maximum pooling of the entity, $\boldsymbol{h}_{ent}$ can be divided into head entity $\boldsymbol{h}_{head}$ and tail entity $\boldsymbol{h}_{tail}$, and $\left\langle \cdot |\cdot \right\rangle $ is the concatenation operator.

3.3 Confidence-Calibrated Classifier

In order to alleviate overconfidence to unseen relations, we introduce contrastive learning techniques to calibrate classifier. A well-calibrated classifier should not only accurately classify known relations, but also give low confidence to unseen relations, that is, $\max _yp(y|x)$.

Given a training batch $\mathcal {B}={(x_i^\ell ,y_i^\ell )}_{i=1}^B$, we obtain a augmented batch $\widetilde{\mathcal {B}}={(\widetilde{x}_i^\ell ,y_i^\ell )}_{i=1}^B$ by applying random transformation $t\in \mathcal {T}$ on $\mathcal {B}$. For brevity, the superscript $\ell $ is omitted in the subsequent elaboration of this section. For each labeled sample $(\widetilde{x}_i, y_i)$, $\widetilde{\mathcal {B}}$ can be divided into two subsets $\widetilde{\mathcal {B}}_{y_i}$ and $\widetilde{\mathcal {B}}_{-y_i}$. $\widetilde{\mathcal {B}}_{y_i}$ denotes a set that contains samples of relation $y_i$ and $\widetilde{\mathcal {B}}_{-y_i}$ contains the rest. The supervised contrastive learning objective is defined as follows:

$$\begin{aligned} \mathcal {L}^{sup}_{cts}(\mathcal {B}, \mathcal {T})=\frac{1}{2B}\sum _{j=1}^{2B}\mathcal {L}_{cts}(\widetilde{x}_i,\widetilde{\mathcal {B}}_{y_i}\backslash \{\widetilde{x}_i\},\widetilde{\mathcal {B}}_{-y_i})\end{aligned}$$

(4)

$$\begin{aligned} \mathcal {L}_{cts}(x,\mathcal {D}^+,\mathcal {D}^-)=-\frac{1}{|\mathcal {D}^+|}log\frac{\sum _{x^\prime \in \mathcal {D}^+}q(x,x^\prime )}{\sum _{x^\prime \in \mathcal {D}^+\cup \mathcal {D}^+}q(x,x^\prime )}\end{aligned}$$

(5)

$$\begin{aligned} q(x,x^\prime )=\exp (sim(\boldsymbol{z}(x),\boldsymbol{z}(x^\prime ))/\tau ), \end{aligned}$$

(6)

where $|\mathcal {D}|$ denotes the number of samples in $\mathcal {D}$, $sim(x,x^\prime )$ denotes the cosine similarity between x and $x^\prime $ and $\tau $ denotes a temperature coefficient. Following Chen et al. [3], we use a additional projection layer $\boldsymbol{t}$ to obtain the contrastive feature $\boldsymbol{z}(x)=\boldsymbol{t}(\boldsymbol{g}(x))$.

Benifiting from contrastive training, the encoder $\boldsymbol{g}(\cdot )$ learns rich features to discriminate between known and novel relations. Accordingly, we train a confidence-calibrated classifier $\boldsymbol{\eta }(\cdot )$ upon $\boldsymbol{g}(\cdot )$ as follows:

$$\begin{aligned} \mathcal {L}=\mathbb {E}_{(x,y)\sim \mathcal {D}^\ell }[\mathcal {L}_{ce}(\boldsymbol{\eta } (\boldsymbol{g}(x_i)),y)], \end{aligned}$$

(7)

where $\mathcal {L}_{ce}$ is the cross entropy loss. In addition, we can easily obtain a large number of training data $\mathcal {D}^{dist}$ through distant supervision. None of the $y_i^{dist}$ in $\mathcal {D}^{dist}$ are known relation, that is, $\{y_i^{dist}\}\cap \{y_j^\ell \}=\emptyset $. These data are only used as negative examples, so the noise in the data will not be a problem. We force the classifier output distribution of negative examples to approximate the uniform distribution by optimizing the cross-entropy between them. Using $\mathcal {D}^{dist}$, we optimize model by following objective instead of Eq. 7.

$$\begin{aligned} \mathcal {L}^{dist}=\mathcal {L}+\lambda \mathbb {E}_{x\sim \mathcal {D}^{dist}}[\mathcal {L}_{ce}(\boldsymbol{\eta }(\boldsymbol{g}(x)),y_{uni})], \end{aligned}$$

(8)

where $\mathcal {L}$ refers to the optimization objective of Eq. 7. $\lambda $ is the hyperparamters that balances the known relation data and distantly supervised data. We can achieve good results simply by setting $\lambda $ to 1 without adjustment. $y_{uni}$ represents a uniform distribution.

Based on the confidence-calibrated classifier, we specify the rejection rule $f(\cdot )$ as follows:

$$\begin{aligned} f(x_i)=\left\{ \begin{array}{ccl} y &{} &{} {max_yp(y|x_i)>\theta }\\ \mathcal {R}^* &{} &{} {Otherwise}, \end{array} \right. \end{aligned}$$

(9)

where $\theta $ is a threshold hyperparameters, the posterior probability $p(y|x_i)$ is the output of classifier $\boldsymbol{\eta }$ and $\mathcal {R}^*$ denotes the rejection option.

3.4 Class-Preserving Transformations

Transformations is the core component of contrastive learning. Our intuition in designing transformation is that feature masks at different views force the model to find more features to discriminate between known relations. These new features can play a vital role in recognizing unseen relations. Why do the above methods work? As shown in Fig. 1, due to the shortcut phenomenon, the model is more inclined to remember the relations between entities and it would make mistakes when predicting new relations between the same entity pair. Intuitively through the mask mechanism, the model could mask out some features that belong to Obama and the United States, and then it will have to find more other features to distinguish the president of from other relations. Therefore it will not learn the Shortcut bias of Obama + the United States = the president of. In this work, we design three class-preserving transformations to mask partial features as follows.

Token Mask. Token mask works in the process of sentence encoding. In this transformation, we randomly mask a certain proportion of tokens to generate a new view of relation representation.

Random Mask. Random mask also works in the process of sentence encoding. Instead of completely masking representation of selected tokens, each dimension of the representation of each word is considered independently in this transformation.

Feature Mask. Feature mask works after sentence encoding. Given a relation instance $x_i^\ell \in \mathcal {D}^\ell $, we first obtain its relation representation $\boldsymbol{r}_i=\boldsymbol{g}(x_i)$. Then we randomly mask a certain proportion of feature dimensions of $\boldsymbol{r}_i$ to generate a new view.

It is certain that a more complicated and diverse transformations will bring additional improvement. This will be one of our future work.

3.5 OpenRE Module

We introduce the OpenRE module for the integrity of the framework, although it is not our main concerns. Based on the rejection rules f described in Sect. 3.3, we can classify samples of known relations while rejecting unseen relations. In this section, we take a step forward. By integrating the off-the-shelf OpenRE method, we try to discover the potential unseen relations in the rejected samples instead of only detecting their existence. We adopt SelfORE [13], a clustering-based OpenRE method, as the building block of our OpenRE module. Various other methods can also be used as the alternative to SelfORE without any constraints. More details about OpenRE methods can be found in the related papers. Overall, the method proposed in this paper is detailed in Algorithm 1.

4 Experimental Setup

In this section, we describe the datasets for training and evaluating the proposed method. We also detail the baseline models for comparison. Finally, we clarify the implementation details.

4.1 Datasets

We conduct our experiments on two well-known relation extraction datasets. In addition, a distantly supervised dataset are used in a auxiliary way.

FewRel. Few-Shot Relation Classification Dataset [10]. FewRel is a human-annotated dataset containing 80 types of relations, each with 700 instances. We use the top 40 relations as known and the middle 20 relations as unseen. Since the relations of FewRel dataset is exactly the same as that of FewRel-Distance, we hold out the last 20 relations for the use of distant supervision. The training set contains 25600 randomly selected samples of known relations. In order to evaluate the rejection performance to the unseen relations, the test/validation set contains 3200/1600 samples composed of known and unseen relations.

TACRED. The TAC Relation Extraction Dataset [37]. TACRED is a human-annotated large-scale relation extraction dataset that covers 41 relation types. Similar to the setting of FewRel, we use the top 31 relations as known and the rest 10 relations as unseen. The training set consists of 18113 randomly selected samples of known relations. The size of validation set and test set are 900 and 1800 respectively, including known and unseen relations. It should be noted that 50% of the unseen relation samples in the validation set and test is no_relation.

FewRel-distant. FewRel-distant contains the distantly-supervised data obtained by the authors of FewRel before human annotation. We use this dataset as the distantly supervised data in our experiments.

4.2 Baselines and Evaluation Metrics

MSP [11]. MSP assumes that correctly classified examples tend to have greater maximum softmax probabilities than examples of unseen classes. Thereby the maximum softmax probabilities are used as confidence score for unseen classes detection.

MSP-TC [17]. MSP-TC uses maximum softmax probabilities with temperature scaling and small perturbations to enhance the separability between known and unseen classes, allowing for more effective detection.

DOC [29]. DOC builds n 1-vs-rest sigmoid classifiers for n known classes respectively. The maximum probability of these binary classifiers is considered as the confidence score for unseen classes detection.

LMCL [18]. Large margin cosine loss (LMCL) aims to learn a discriminative deep representations. It forces the model to not only classify correctly but also maximize inter-class variance and minimize intra-class variance. Based on the learned representations, local outlier factor (LOF) is used to detect unseen classes.

ADB [34]. Labeled known classes samples are first used for representation learning. Then the learned representations are utilized to learn the adaptive spherical decision boundaries for each known classes. Samples outside the hypersphere will be rejected for recognition.

Evaluation Metrics. We follow previous work [18, 34] and take all the unseen relations as one rejected class. The accuracy and macro F1 metrics are used as the scoring function to evaluate the unseen relation detection.

4.3 Implementation Details

We use the Adam [16] as the optimizer, with a learning rate of $1e-4$ and batch size of 100 for all datasets. If the results don’t improve on the validation set for 10 epochs, we stop the training to avoid overfitting. All experiments are conducted using a NVIDIA GeForce RTX 3090 with 24 GB memory.

5 Results and Analysis

In this section, we present the experimental results of our method on FewRel and TACRED datasets to demonstrate the effectiveness of our method.

5.1 Main Results

Table 2. Main results of unseen relation detection with different known class proportions (25%, 50% and 75%) on two relation extraction datasets. Compared with the best results of all baselines, our method improves $F_1$-score by an average of 2.6%, 3.5% on FewRel and TACRED dataset, respectively.

Full size table

Table 3. Macro $F_1$-score of known relation classification with different proportion of known relations.

Full size table

Our experiments in this section focus on the following three related questions.

Can the Proposed Method Effectively Detect Unseen Relations? To answer this question, we consider all the known relations as one predicted class and the rest unseen relations as one rejected class. Table 2 reports model performances on FewRel, TACRED datasets, which shows that the proposed method achieves state-of-the-art results on unseen relation detection. Benefiting from the contrastive training objectives and the carefully designed transformations, the Shortcut phenomenon is effectively alleviated, and the model learns sufficient features to discriminate between known and unseen relations. Therefore, the proposed method consistently outperforms the compared baselines by a large margin in different mixing-ratio settings.

Does the Detection of Unseen Relations Impair the Extraction of Known Relations? Integrating the rejection option can make the classifier more robust in real applications. However, we do not want the unseen relations detection impair known relations classification, which is the basic function of the classifier. From table 3 we can observe that the proposed model not only effectively detect unseen relations, but also accurately classify known relations. This demonstrate that the designed transformation will not affect the original relational semantics, so the rich features obtained by comparative learning remain discriminability for the known relations.

Can the Model Achieve Superior Performance Under Different Threshold Settings? We show the receiver operating characteristic (ROC) curve in Fig. 3. The area under ROC curve (AUROC) summarize the performance of a classifier detecting unseen relations across different thresholds. From Fig. 3 we can observe that the AUROC of the proposed method is the largest. Therefore, the proposed method has certain advantages under different threshold settings.

5.2 Ablation Study

To understand the effects of each component of the proposed model, we conduct an ablation study on it and report the results (Macro-$F_1$) on the two dataset in Table 4. The results show that the detection of unseen relations is degraded if any transformation is removed. It indicates that (1) These transformations force model learn sufficient features through mask mechanism from different views. The learned features are beneficial for the detection of unseen relations. (2) Since the transformations are from different views, they can be superimposed and further enhance the detection of unseen relations. In addition, we find that distantly supervised data can significantly improve the detection of unseen relations. Because there are a large number of diverse relations in the external knowledge base, we can easily construct a large number of negative samples. So this improvement can be seen as a free lunch.

Table 4. Abalation study of our method.

Full size table

5.3 Relation Representation Visualization

To intuitively show the influence of the rich features learned through contrastive training, we visualize the relational representation with t-SNE [23]. We select five semantically similar known relations from FewRel dataset, and randomly select 40 samples for each of them. 100 hard samples of unseen relations misclassified by MSP method are selected to show the superiority of our method. From the visualization results in Fig. 4, we can observe that, before training (upper left), the relation representations are scattered in the semantic space. After supervised training (upper right), samples can be roughly divided by relation, but different relations are still close to each other. This is consistent with the Shortcut feature in neural network. We note that samples of unseen relations are mixed with known relation samples. After contrastive training (down left), model learns sufficient features to discriminate unseen relations. Therefore, samples of unseen relations are effectively separated. Finally, a best relation representation are obtained by applying both supervised and contrastive optimization (down right).

5.4 A Case Study on OpenRE

Table 5. Extracted and golden surface-form relation names on TACRED.

Full size table

For the samples rejected by the classifier, the off-the-shelf OpenRE method can be used to discovery potential unseen relations. In this section, we provide a brief case study to show the discovered unseen relations by SelfORE [13]. OpenRE module outputs the cluster assignment of these rejected samples. We extract the relation names using the frequent n-gram in each cluster and the extraction results are shown in Table 5. By integrating the OpenRE module, our method complete (1) the classification of known relations, (2) the rejection of unseen relations, (3) discovery of unseen relations. Based on the above process, robust relation extraction in real applications is realized.

6 Conclusions

In this work, we introduce a relation extraction method with rejection option to improve the robustness in real-world applications. The proposed method employs contrastive training techniques and a set of carefully designed transformations to learn sufficient features. The classification of known relations and rejection of unseen relations can be done with these features. Unseen relations in the rejected samples can be discovered by incorporating off-the-shelf OpenRE methods. Experimental results show that our method outperforms SOTA methods for unseen relation rejection.

References

Baldini Soares, L., FitzGerald, N., Ling, J., Kwiatkowski, T.: Matching the blanks: distributional similarity for relation learning. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2895–2905. Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/P19-1279, https://aclanthology.org/P19-1279
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. SIGMOD Rec. 29(2), 93–104 (2000). https://doi.org/10.1145/335191.335388
Article Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. CoRR abs/2002.05709 (2020), https://arxiv.org/abs/2002.05709
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018), http://arxiv.org/abs/1810.04805
Distiawan, B., Weikum, G., Qi, J., Zhang, R.: Neural relation extraction for knowledge base enrichment. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 229–240 (2019)
Google Scholar
Du, J., Han, J., Way, A., Wan, D.: Multi-level structured self-attentions for distantly supervised relation extraction. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2216–2225. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-1245, https://aclanthology.org/D18-1245
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545. Association for Computational Linguistics, Edinburgh, Scotland, UK (2011). https://aclanthology.org/D11-1142
Gallaire, H., Minker, J. (eds.): On Closed World Data Bases, pp. 55–76. Springer, US, Boston, MA (1978). https://doi.org/10.1007/978-1-4684-3384-5_3, https://doi.org/10.1007/978-1-4684-3384-5_3
Geirhos, R., et al.: Shortcut learning in deep neural networks. Nat. Mach. Intell. 2(11), 665–673 (2020)
Article Google Scholar
Han, X., et al.: FewRel: a large-scale supervised few-shot relation classification dataset with state-of-the-art evaluation. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4803–4809. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-1514, https://aclanthology.org/D18-1514
Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference Track Proceedings. OpenReview.net (2017). https://openreview.net/forum?id=Hkg4TI9xl
Honovich, O., Choshen, L., Aharoni, R., Neeman, E., Szpektor, I., Abend, O.: Q${}^{\text{2}}$: evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. CoRR abs/2104.08202 (2021). https://arxiv.org/abs/2104.08202
Hu, X., Wen, L., Xu, Y., Zhang, C., Yu, P.S.: Selfore: self-supervised relational feature learning for open relation extraction. CoRR abs/2004.02438 (2020). https://arxiv.org/abs/2004.02438
Huang, Y.Y., Wang, W.Y.: Deep residual learning for weakly-supervised relation extraction. CoRR abs/1707.08866 (2017). http://arxiv.org/abs/1707.08866
Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L., Levy, O.: SpanBERT: Improving Pre-training by Representing and Predicting Spans. arXiv e-prints arXiv:1907.10529 (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015). http://arxiv.org/abs/1412.6980
Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of out-of-distribution image detection in neural networks. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=H1VGkIxRZ
Lin, T.E., Xu, H.: Deep unknown intent detection with margin loss. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5491–5496. Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/P19-1548, https://aclanthology.org/P19-1548
Lin, Y., Shen, S., Liu, Z., Luan, H., Sun, M.: Neural relation extraction with selective attention over instances. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2124–2133 (2016)
Google Scholar
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422 (2008). https://doi.org/10.1109/ICDM.2008.17
Liu, Y., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv e-prints arXiv:1907.11692 (2019)
Ma, R., Gui, T., Li, L., Zhang, Q., Zhou, Y., Huang, X.: SENT: sentence-level distant relation extraction via negative training. CoRR abs/2106.11566 (2021). https://arxiv.org/abs/2106.11566
van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(86), 2579–2605 (2008). http://jmlr.org/papers/v9/vandermaaten08a.html
Miwa, M., Bansal, M.: End-to-end relation extraction using LSTMs on sequences and tree structures. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1105–1116. Association for Computational Linguistics, Berlin, Germany (2016). https://doi.org/10.18653/v1/P16-1105, https://aclanthology.org/P16-1105
Nguyen, A.M., Yosinski, J., Clune, J.: Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. CoRR abs/1412.1897 (2014). http://arxiv.org/abs/1412.1897
Peng, H., et al.: Learning from Context or Names? An Empirical Study on Neural Relation Extraction. arXiv e-prints arXiv:2010.01923 (2020)
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: International Conference on Machine Learning, pp. 5389–5400. PMLR (2019)
Google Scholar
Schïlkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A.J., Williamson, R.C.: Estimating the Support of a High-Dimensional Distribution. Neural Comput. 13(7), 1443–1471 (2001). https://doi.org/10.1162/089976601750264965
Article MATH Google Scholar
Shu, L., Xu, H., Liu, B.: DOC: deep open classification of text documents. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2911–2916. Association for Computational Linguistics, Copenhagen, Denmark (2017). https://doi.org/10.18653/v1/D17-1314, https://aclanthology.org/D17-1314
Verga, P., Strubell, E., McCallum, A.: Simultaneously self-attending to all mentions for full-abstract biological relation extraction. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 872–884. Association for Computational Linguistics, New Orleans, Louisiana (2018). https://doi.org/10.18653/v1/N18-1080, https://aclanthology.org/N18-1080
Wu, S., He, Y.: Enriching pre-trained language model with entity information for relation classification. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2361–2364 (2019)
Google Scholar
Xiong, C., Power, R., Callan, J.: Explicit semantic ranking for academic search via knowledge graph embedding. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1271–1279 (2017)
Google Scholar
Yan, G., et al.: Unknown intent detection using Gaussian mixture model with an application to zero-shot intent classification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1050–1060. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.acl-main.99, https://aclanthology.org/2020.acl-main.99
Zhang, H., Xu, H., Lin, T.E.: Deep open intent classification with adaptive decision boundary. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 14374–14382 (2021)
Google Scholar
Zhang, N., et al.: Long-tail relation extraction via knowledge graph embeddings and graph convolution networks. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3016–3025. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://doi.org/10.18653/v1/N19-1306, https://aclanthology.org/N19-1306
Zhang, Y., Qi, P., Manning, C.D.: Graph convolution over pruned dependency trees improves relation extraction. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2205–2215. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-1244, https://aclanthology.org/D18-1244
Zhang, Y., Zhong, V., Chen, D., Angeli, G., Manning, C.D.: Position-aware attention and supervised data improve slot filling. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 35–45. Association for Computational Linguistics, Copenhagen, Denmark (2017). https://doi.org/10.18653/v1/D17-1004, https://aclanthology.org/D17-1004
Zhao, J., Gui, T., Zhang, Q., Zhou, Y.: A relation-oriented clustering method for open relation extraction (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China
Jun Zhao, Yongxin Zhang, Nuo Xu, Tao Gui & Qi Zhang
DataGrand Information Technology (Shanghai) Co., Ltd., Shanghai, China
Yunwen Chen & Xiang Gao

Authors

Jun Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yongxin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Nuo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Tao Gui
View author publications
You can also search for this author in PubMed Google Scholar
Qi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yunwen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Tao Gui or Qi Zhang .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Maosong Sun
Tsinghua University, Beijing, China
Yang Liu
Harbin Institute of Technology, Harbin, China
Wanxiang Che
Chinese Academy of Sciences, Institute of Computing Technology, Beijing, China
Yang Feng
Fudan University, Shanghai, China
Xipeng Qiu
Beijing Language and Culture University, Beijing, China
Gaoqi Rao
Chinese Academy of Sciences, Institute of Automation, Beijing, China
Yubo Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, J. et al. (2022). Abstains from Prediction: Towards Robust Relation Extraction in Real World. In: Sun, M., et al. Chinese Computational Linguistics. CCL 2022. Lecture Notes in Computer Science(), vol 13603. Springer, Cham. https://doi.org/10.1007/978-3-031-18315-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-18315-7_7
Published: 06 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18314-0
Online ISBN: 978-3-031-18315-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstains from Prediction: Towards Robust Relation Extraction in Real World

Abstract

Similar content being viewed by others

Relation Repository Based Adaptive Clustering for Open Relation Extraction

Hierarchical symmetric cross entropy for distant supervised relation extraction

End-to-end relation extraction based on bootstrapped multi-level distant supervision

Keywords

1 Introduction