Keywords

1 Introduction

Relation extraction is an important basic work for building large-scale knowledge bases such as semantic networks and knowledge graphs [1,2,3]. However, conventional relation extraction methods such as semi-supervision and distant supervision are generally used to deal with pre-defined relations and cannot well identify emerging relations in the real world.

Against this background, OpenRE has been widely studied for its ability to mine novel relation from massive text data. At present, OpenRE is mainly based on unsupervised methods, which can be divided into two categories. The first group is pattern extraction models [4,5,6], which usually uses sentence analysis tools, combined with linguistics and professional domain knowledge, to construct artificial rules based on lexical, syntactic and semantic features. When performing relation extraction tasks, different relation types are obtained by matching rules with the preprocessed text. However, with the expansion of the relational model set, the complexity of the system is greatly increased, and it is difficult to be widely used in the open field. The second group is to discover various relation types through unsupervised methods [7,8,9]. This work optimizes the representation of relations to improve the accuracy of unsupervised clustering while overcoming the instability of unsupervised training. Recently, some RE methods work begin to study better utilization of hand-crafted features, which only use named entities to induce relation types [10]. The hierarchy information in relation types is further exploited for better novel relation extraction [11].

However, much research has shown that complex linguistic information requires high-dimensional embeddings so that the meaning of the text becomes clear [12]. This complex information may contain local syntactic [13] and semantic structures [14]. Therefore, the position and relative distance in the high-dimensional vector space is not completely consistent with the relational semantic similarity. Especially before model training starts, even with deep neural networks, different classes may still overlap in high-dimensional space [15].

We propose a relational instance-based clustering method with contrastive learning in this work. In order to make the model better mine the information of the relation instance itself to produce better clustering results, the nonlinear mapping is optimized by using the difference information of the constructed relation instance's comparative dataset and the distribution information of the original instance dataset. High-dimensional relational instance features of complex information are transformed into relation-oriented low-dimensional feature representations. Specifically, we pull together instances representing the same relationship while pushing apart those from different ones by jointly optimizing distribution loss and contrastive loss so that the learned representation is cluster-friendly. In addition, the proposed method obtains supervision from the data itself and its corresponding augmented dataset and iteratively learns better feature representations for relation classification tasks to improve the quality of supervision, which in turn improves cluster purity and separates distances between different clusters.

Overall, our work has the following contributions: (1) we propose a self-supervised framework which can fine-tune pretrained MLMs into capable universal relational encoders and extensively learn to cluster relational data; (2) we show how to use contrastive learning to learn and improve representations of relation instances in a self-supervised manner.

2 Related Work

Self-supervised learning has recently achieved excellent results on multiple tasks in the image and text domains, and many studies have been further developed thanks to its effectiveness in feature representation work. The quality of learned representations is assured by a theoretical framework based on contrast learning [16], which learns self-features from unlabeled data and formalize the concept of semantic similarity through latent classes to improve the performance of classification tasks. Hu et al. [9] propose adaptive clustering algorithms and uses pseudo-labels of relations as self-supervised signals to optimize their semantic representations. Recently, there has been an increasing interest in contrast learning using individual raw sentences based on PLMs [15, 17, 18].

Meanwhile, inspired by research related to contrast learning in computer vision [19, 20], we utilize “multi-view” contrastive learning for relation extraction. Previous work mainly uses sentences as the smallest unit of text input, builds enhanced datasets by randomly masking characters or replacing words, and uses semantic similarity as the goal of the measurement model. In contrast, our work takes entity word pairs as the minimum granularity of semantic representation, abstracts various types of relations, and obtains their vector representations with the help of the idea of clustering. It not only maintains the advantages of unsupervised learning, which can deal with deal with undefined relation types, but also exerts the advantages of supervised learning, which has a strong guiding ability for relational feature learning.

Fig. 1.
figure 1

Overall architecture of RICL

3 Methodology

In this work, we propose a simple and effective approach to relation clustering, which exploits relation instance distribution information in unlabeled data and semantic information from pretrained models, enabling the model to optimize the representation of relations.

In order to alleviate the overlap of different relation clusters in the representation space, we improve the clustering of unlabeled data by contrastive learning to promote better separation. The proposed method is shown in Fig. 1.

We build a “multi-view” of the training corpus, gradually optimize the representation of relation instances in a joint learning manner and aggregate to generate pseudo-labels, and fine-tune the pre-trained language model through the classification. As shown in Figure 1., we mainly iteratively perform the following steps:

(1) First, we use the pretrained BERT as the encoder of relational instances \({\{{h}_{i}\}}_{i=1,\dots ,N}\); each relational instance \({h}_{i}\) is composed of an entity pair vector as the output vector. However, high-dimensional representations of \(h\) contain too much information (structural features, semantic information, etc.), and the direct use of high-dimensional vectors for clustering cannot align well with the relationships corresponding to instances.

(2) In order to better reflect the semantic similarity between each other through the distance between the relation representation spaces, we transform the high-dimensional representations of relation instances \({h}_{i}\) into low-dimensional representations \({h}_{i}^{\mathrm{^{\prime}}}\) through non-linear mapping \(g\). However, the quality of pseudo-labels produced by direct clustering is not high, which is not conducive to downstream classification tasks.

(3) In order to reduce the negative impact of pseudo-label errors, we apply different dropouts under the same pre-training model to construct a positive set and other data under the same batch as a negative set. During the training process, aiming at the aggregation of clusters of similar relational instances and the separation of different instances, the representation of relation instances is optimized to improve the quality of pseudo-labels produced by clustering. Pseudo-labels serve as prior knowledge of the dataset and are finally used for supervised relation classification. The above steps are executed iteratively until the clustering result tends to be stable.

3.1 Relational Instance Encoder

The relational instance encoder is to extract the semantic relation representations between two arbitrary given entities in a sentence. We utilize a large pretrained language model to efficiently encode entity pairs and their contextual information.

For sentence \(S=[{s}_{1},\dots ,{s}_{n}]\), we introduce two pairs of special identifiers \([E1\backslash ], \left[\backslash E1\right], [E2], [\backslash E2]\) to mark entities and inject them to \(S=[{s}_{1},\dots ,\left[E1\backslash \right],{s}_{i},\dots ,{s}_{k},\left[\backslash E1\right],\dots ,\left[E2\backslash \right],{s}_{m},\dots {s}_{j},\left[\backslash E2\right],\dots ,{s}_{n}]\). We adopt BERT [21] as our encoder \(l(\bullet )\) due to its strong performance and wide application in extracting semantic information. Formally:

$$ v_{1} ,...,v_{n} = l\left( {s_{1} ,...,s_{n} } \right) $$
(1)
$$ h = \left[ {v_{[E1\backslash ]} ,v_{[E2\backslash ]} } \right] $$
(2)

where \({v}_{i}\) is a word vector generated by BERT, we use the outputs concatenated by \({v}_{[E1/]}\) and \({v}_{[E2/]}\) as the representation of the relational instance. This method of relational representation has been widely used in previous RE methods [9, 22, 23].

3.2 Instance-Relational Contrastive Learning

We use the distribution information of relation instances and their own feature information to build a joint model to achieve deep clustering. As shown in Fig. 1, our joint learning model is composed of two components, \(f(\bullet )\) and \(g(\bullet )\), using clustering loss and contrastive loss, respectively. We describe the specific structure of the model in Sect. 4.

Dropout Noise as Data Augmentation. We use different dropouts to obtain different vector representations of the same text. Specifically, for each batch \(B={\{{h}_{i}\}}_{i=1}^{M}\), we generate a new vector representation for each relation instance in \(B\) and then get an augmented batch \({B}^{a}={\{{h}_{i},{\tilde{h }}_{i}\}}_{i=1}^{M}\). The positive pair \({h}_{i}\), \({\tilde{h }}_{i}\) takes exactly the same relational instance, and their embeddings only differ in dropout masks, while treating the other \(2M-2\) instances as negative instances \(N\) of this positive pair. Here the dropout rate \(p\) is 0.1.

Given a batch of data \({B}^{a}\), \(\tau \) denotes a temperature parameter. We leverage the standard InfoNCE loss [24] to aggregate the positive pairs together and separate the negative pairs in the embedding space:

$$ L_{a} = - \sum\nolimits_{i = 1}^{M} {\log \frac{{{{\exp (\cos (g(h_{i} ),g(\tilde{h}_{i} )))} \mathord{\left/ {\vphantom {{\exp (\cos (g(h_{i} ),g(\tilde{h}_{i} )))} \tau }} \right. \kern-0pt} \tau }}}{{\sum\nolimits_{{h_{j} \in N}} {\exp (\cos (g(h_{i} ),g(h_{i} )))} }}} $$
(3)

3.3 Clustering

Different from contrastive learning, clustering focuses on the similarity between different instances, encodes abstract semantic information into representations of relation instances, and finally aggregates instances of the same relation.

The known dataset consists of \(K\) relation classes. The centroid representation of each class denoted as \({u}_{k}\), \(k\in \{1,...,K\}\). We compute the probability of assigning \({h}_{i}\) to the \({k}^{th}\) cluster by student’s t-distribution [25]:

$$ q_{ik} = \frac{{({{1 + ||h_{i} - u_{k} ||_{2}^{2} } \mathord{\left/ {\vphantom {{1 + ||h_{i} - u_{k} ||_{2}^{2} } \alpha }} \right. \kern-0pt} \alpha })^{{ - \tfrac{\alpha + 1}{2}}} }}{{\sum\nolimits_{{k^{\prime} = 1}}^{K} {({{1 + ||h_{i} - u_{{k^{\prime}}} ||_{2}^{2} } \mathord{\left/ {\vphantom {{1 + ||h_{i} - u_{{k^{\prime}}} ||_{2}^{2} } \alpha }} \right. \kern-0pt} \alpha })^{{ - \tfrac{\alpha + 1}{2}}} } }} $$
(4)

Here α denotes the degree of freedom of the student’s t-distribution and \({q}_{ik}\) can be regarded as the probability of the cluster assignment. In general, we follow Maaten and Hinton [25] by setting α = 1.

A linear layer \(f(\bullet )\) is used to fit the centroid of each relation cluster and then iteratively improve it by the auxiliary distribution proposed by Xie et al. [26] Concretely, defining \({p}_{ik}\) as the auxiliary probability:

$$ p_{ik} = \frac{{{{q_{ik}^{2} } \mathord{\left/ {\vphantom {{q_{ik}^{2} } {f_{k} }}} \right. \kern-0pt} {f_{k} }}}}{{{{\sum\nolimits_{{k^{\prime}}} {q_{ik}^{2} } } \mathord{\left/ {\vphantom {{\sum\nolimits_{{k^{\prime}}} {q_{ik}^{2} } } {f_{{k^{\prime}}} }}} \right. \kern-0pt} {f_{{k^{\prime}}} }}}} $$
(5)

where \({f}_{k}={\sum }_{i=1}^{M}{q}_{ik},k=1,\dots ,K\) is the cluster frequency within a batch, the purpose of this is to encourage learning from high-confidence cluster assignments while improving low-confidence tasks against biases caused by imbalanced clusters, resulting in better clustering performance.

We optimize the KL divergence loss between the cluster assignment probability and the target distribution:

$$ L_{b} = KL(P||Q) = \sum\nolimits_{i} {\sum\nolimits_{k} {p_{ik} \log \frac{{p_{ik} }}{{q_{ik} }}} } $$
(6)

In conclusion, our overall objective is,

$$ L = (1 - \varepsilon )L_{a} + \varepsilon L_{b} $$
(7)

\(\varepsilon \) balances between the clustering loss and the contrastive loss of RICL is set to 0.65. Note that \({L}_{b}\) is only optimized on the initial data, and the parameters for \(f(\bullet )\) and \(g(\bullet )\) will be updated-parameters in the \(l(\bullet )\) are not improved when minimizing \(L\).

Finally, we obtain \({\{{h}_{i}^{\mathrm{^{\prime}}}\}}_{i=1}^{M}\) using the optimized \(g(\bullet )\) and \(f(\bullet )\), and then generate pseudo-labels \({y}^{\mathrm{^{\prime}}}\) by k-means algorithm:

$$ y^{\prime} = Kmeans(h^{\prime}) $$
(8)

3.4 Relation Classification

Based on the pseudo-labels \({y}^{\mathrm{^{\prime}}}\) generated by clustering, we can use supervised learning to train the classifier and refine relational instance \(h\) to encode more relational semantic information:

$$ l_{n} = \mu_{\tau } (l_{\theta } (S)) $$
(9)
$$ L_{C} = \mathop {\min }\limits_{\theta ,\tau } \frac{1}{M}\sum\nolimits_{n = 1}^{M} {loss(l_{n} ,one\_hot(y^{\prime}_{n} ))} $$
(10)

where \({\mu }_{\tau }\) denotes the relation classification module parameterized by \(\tau \) and \({I}_{n}\) is a probability distribution over \(K\) pseudo-labels for the original data. In order to find the best-performing parameters \(\theta \) for Relational Instance Encoder and \(\tau \) for Relation Classification, we optimize the above classification loss.

4 Experimental Setup

We first introduce publicly available datasets for training and evaluation. Then we briefly introduce the baseline models used for comparison. Finally, we elaborate on the hyperparameter configuration and implementation details of RICL.

4.1 Datasets

We conduct experiments and comparisons on three open-domain datasets.

FewRel. Few-Shot Relation Classification Dataset is derived from Wikipedia and annotated by humans [27]. FewRel contains 80 types of relations, each with 700 instances. Following the paper [7], we use all instances of 64 relations as training set, and the test set of FewRel, which randomly selects 16 relations with 1600 instances.

T-REx SPO and T-REx DS. They come from the T-Rex dataset [28], which is generated by aligning Wikipedia corpus with Wiki-data. At first, we need to preprocess each sentence in the dataset. If there are multiple entity pairs in a sentence, the sentence will be retained for the same number of times according to the number of occurrences of different entity pairs. And then, we built two datasets, T-REx SPO and T-REx DS, according to Hu et al. [9]. In both datasets, 80% of sentences will be used for model training, and the remaining 20% were set aside for validation, the rest for testing.

4.2 Baseline and Evaluation Metrics

We use standard unsupervised evaluation metrics for comparisons with the other six baseline algorithms. For all models, we assume the number of target relation classes is known in advance, but no human annotations are available to extract relations from the open-domain data. We set the number of clusters to the number of ground-truth classes and evaluate performance with B3, V-measure, and ARI [8, 9, 29]. To evaluate the effectiveness of our method, we select the following SOTA OpenRE models for comparison.

VAE [30] consists of a classifier that predicts relations and a factorization model which reconstructs arguments. The model is jointly optimized by reconstructing entities from pairing entities and predicted relations.

UIE [8] trains a discriminative relation extraction model by introducing a skewness loss and a distribution distance loss to make the model confidently predict each relation and encourage the average prediction of all relations.

SelfORE [9] uses an adaptive clustering algorithm to obtain relation sets based on a large pretrained language model and then uses the pseudo-labels of relations as self-supervised signals to optimize their semantic representations.

EI_ORE [29] conduct Element Intervention, which intervenes on the context and entities respectively to obtain the underlying causal effects of them, to address the spurious correlations from entities and context to the relation type.

RW-HAC [31] reconstructs word embeddings and uses single feature reduction to alleviate the feature sparsity problem for relation extraction through clustering.

Etype + [10] consists of two regularization methods and a link predictor and uses only named entity types to induce relation types.

4.3 Implementation Details

Follow the settings used in previous work [8, 9, 29, 30], at T-REx SPO and T-REx DS datasets, RICL are trained with 10 relation classes. Although it is lower than the number of real relationships in the dataset, it still reveals important insights due to the very imbalanced distribution of relationships on the 10 relation classes of data used for training and testing.

For Relational Instance Encoder, we use the default tokenizer in BERT to preprocess all datasets and set the max length of a sentence as 128. We use the BERT-base-uncased model to initialize parameters for \(l(\bullet )\) and use BertAdam to optimize the loss.

For Instance-relational Contrastive Learning, we use an MLP \(g(\bullet )\) with fully connected layers with the following dimensions \({\mathbb{R}}^{d}\)-512–512-256. We randomly initialize weights following Xie et al. [26]. For Clustering, we use a linear layer \(f(\bullet )\) of size \(256\times K\) with \(K\) indicating the number of clusters, and initialize the cluster centers by the Kmean algorithm.

For Relation Classification, we use a fully connected layer as \({\mu }_{\tau }\) and set the dropout rate to 10%, the learning rate to 5e − 5, and the warm-up rate to 0.1. In the process of fine-tuning BERT, we freeze its first 8 layers. All experiments are conducted using an NVIDIA GeForce RTX 3090 with 24GB memory.

5 Results and Analysis

In this section, we present the experimental results of RICL on three open-domain datasets, and verify the rationality of the framework through ablation experiments. Finally, we prove its effectiveness by combining data characteristics and visual analysis.

Table 1. Main results on three relation extraction datasets.

5.1 Main Results

Table 1 reports model performances on T-Rex SPO, T-Rex DS, and FewRel dataset, which shows that the proposed method achieves state-of-the-art results on the OpenRE task. Benefiting from the rich information in the pre-trained model, RICL exploits the relation distribution in unlabeled data and optimizes the relation representation through the method of contrastive learning so as to achieve a better clustering effect, thus greatly surpassing previous cluster-based baselines.

Table 2. Ablation results on T-Rex SPO and FewRel

5.2 Ablation Study

In order to study the effect of each algorithm in the proposed framework, we conduct ablation experiments on two datasets, respectively, and the results are presented in Table 2. The results show that the model performance is degraded if \({L}_{a}\) is removed, indicating that Instance-relational Contrastive Learning can produce superior relation embeddings from either unlabeled data. It is worth noting that Clustering has an important role in RICL. It prevents the excessive separation of the same relation instance in the space, avoids the collapse of the relation semantic space. At the same time, it provides guidance for downstream relation classification and optimizes the representation of relation instances. In addition, joint optimizing on both the Clustering and the Contrastive Learning is also very important. While alleviating the overlap of different relation classes in the representation space, different instances under the same class are aggregated.

Fig. 2.
figure 2

Visualization of feature embeddings on FewRel-5

5.3 Visualization and Analysis

To further explore the performance of RICL and the rationality of its design, we randomly select 5 types of data in the FewRel dataset and visualize the embedded features from BERT-base-uncased (left) and RICL (right) with t-SNE in Fig. 2. It is convenient for us to observe the changes in class distribution.

In the initial distribution, we observe that classes 2, 3, 4 have high purity, but these classes are not highly clustered and have slight overlap at the boundaries. The relation instances of class 1 and 5 are heavily overlapped in space. Through the analysis of relationship classes and their instances, class 1 describes the “located in” relation between the airport and the place it belongs to, and class 5 describes the “located in” relation between the regional locality and the city or country. These two classes are affected by factors such as relational semantics and entity types [10], and some relation instances are spatially closely distributed.

From a global perspective, RICL achieves better separation of each class in space, solves the problem of blurred boundaries, ensures the overall consistency, and explores the possibility of further subdividing categories under the same class. While classes 2, 3, 4 are aggregated, they are separated from different class as much as possible in space to ensure semantic consistency. When dealing with class 1 and class 5 overlapping problems, RICL locally aggregates discretely distributed class 5 instances and separates them from class 1 while guaranteeing relational consistency, thereby improving class purity as much as possible.

6 Conclusions

In this paper, we propose a novel self-supervised learning framework for open-domain relation extraction, namely RICL. It aims to enable the neural network to obtain better relation-oriented representation encoding and how to better handle relational instances in the open domain in a self-supervised manner. We utilize instance distribution information and contrastive learning to promote better aggregation and relational representation, improving clustering accuracy and reducing error propagation, thus benefiting downstream classification tasks. Moreover, we iteratively improve the robustness of the neural encoder by using pseudo-labels as self-supervised signals for relation classification. Our experiments show that RICL can perform more efficient and accurate relation extraction on open-domain corpora than previous methods, and can construct a representation space more suitable for semantic tasks.