Keywords

1 Introduction

Relation extraction plays a vital role within the domain of natural language processing by identifying and extracting the relationships between different entities mentioned in textual data. It has significant applications in various downstream tasks, including text mining [25], knowledge graph [18] and social networking [21]. This problem is commonly approached as a supervised classification task. However, the process of labeling large datasets of sentences for relation extraction is not only time-consuming but also expensive. This limitation often leads to a scarcity of labeled data available for training relation extraction models, hindering the progress and evolution of this task. To address this challenge, researchers have delved into a new field of study called few-shot relation extraction (FSRE). FSRE is viewed a model aligned with human learning and is considered the most promising method for tackling this task. Existing studies [6, 19, 22, 23] have achieved remarkable results and surpassed human levels in general tests. Some researchers [2, 11], however, observe that most proposed models are ineffective in real applications and are looking into specific reasons such as task difficulty. Most task-agnostic and task-specific models [16] actually struggle to handle the FSRE task adaptively with varying difficulties. Figure 1 depicts a process of FSRE augmented by the relation instances. Due to the high similarity of relationships and the limited number of training instances, the existing methods are hard to identify the relation mother between entity pair (Isabella Jagiellon, Bona Sforza) for the query instance.

Fig. 1.
figure 1

A novel 5-way 1-shot relation extraction pipeline. First, use the relation set (left) and then enhance the relation representation with the support set (right). Finally, self-adapt to obtain true value for similar relationships in difficult tasks.

The FSRE task is commonly tackled through two prominent approaches: meta-learning based methods and metric-learning based methods [15, 32]. Meta-learning, with its emphasis on the acquisition of learning abilities, focuses on training a task-agnostic model with various subtasks, enabling it to learn how to predict scores for novel classes. In contrast, metric-based approaches aim to discover an improved metric for quantifying distribution discrepancies and effectively discerning different categories. Among them, the prototype network [11, 27] is widely used as a metric-based approach and has demonstrated remarkable performance in FSRE scenarios. Compared to meta-learning based methods, metric-based methods in FSRE task usually focus on training task-specific models that have poorer adaptability to new tasks. Conversely, the meta-learning based methods may struggle with the discriminative features of relations between entity pairs during the training process, leading to underwhelming performance. This can be attributed primarily to the scarcity of training instances for each subtask in FSRE scenarios. As shown in Fig. 1, when only using the common features from relation instances as the metric index, it is difficult to distinguish the pair (IsabellaJagiellon, BonaSforza) in the query sentence from the five similar relations. While the discriminative features of the support instances are introduced, the features have more obvious distinctions, and the score of the mother is higher than others. In order to address these challenges in few-shot relation extraction, we present a novel adaptive prototype network representation that incorporates both relation-meta features and various instance features, called AdaProto. Firstly, we introduce relation-meta learning to obtain the common features of the relations so that the model does not excessively deviate from its relational classes, thus making the model have better adaptability. Besides, we propose entity pair rectification and instance representation learning to gain discriminative features of the task, thus increasing the discriminability of the relations. Finally, we employ a multi-task training strategy to achieve a model of superior quality.

We can summarize our main contributions into three distinct aspects: 1) We present a relation-meta learning approach that enables the extraction of common features from relation instances, resulting in enhanced adaptability of the model. 2) Our model combines common features with discriminative features learned from the support set to enhance the recognizability of relational classification. 3) We assess the performance of our proposed model on the extensively utilized datasets FewRel through experiments employing the multi-task training strategy. The experimental results confirm the model’s excellent performance.

Fig. 2.
figure 2

The overall framework of AdaProto. The relation-meta learner exploits relation embeddings and relation meta to transfer the relative common features. To capture the discriminative features, \(E_{t}\) and \(X_{t}\) represent the entity pair rectification and instance embeddings, respectively. Finally, multi-task training adaptively fuses the above two features into different tasks.

2 Related Work

2.1 Few-Shot Relation Extraction

Relation extraction, a fundamental task within the domain of natural language processing (NLP), encompassing a wide range of applications, including but not limited to text mining [25] and knowledge graph [18]. Its primary objective is to discern the latent associations among entities explicitly mentioned within sentences. Few-shot relation extraction (FSRE), as a specialized approach, delves into the prediction of novel relations through the utilization of models trained on a restricted set of annotated examples. Han et al. [13] introduce a comprehensive benchmark called FewRel, which serves as a large-scale evaluation platform specifically designed for FSRE. Expanding upon this initiative, Gao et al. [9] presented FewRel 2.0, an extended iteration of the dataset that introduces real-world challenges, including domain adaptation and none-of-the-above detection. These augmentations render FewRel 2.0 a more veracious and efficacious milieu for the comprehensive evaluation of FSRE techniques.

Meta-learning based and metric-learning based methods are the two predominant approaches widely employed in solving the FSRE task. Meta-learning, with its emphasis on the acquisition of learning abilities, strives to educate a task-agnostic model can acquire knowledge and making predictions for novel classes through exposure to diverse subtasks. The model-agnostic meta-learning (MAML) algorithm was introduced by Finn et al. [7]. This algorithm is an adaptable technique applicable to a broad spectrum of learning scenarios by leveraging gradient descent-based training. Dong et al. [5] introduced a meta-information guided meta-learning framework that utilizes semantic notions of classes to enhance the initialization and adaptation stages of meta-learning, resulting in improved performance and robustness across various tasks and domains. Lee et al. [15] proposed a domain-agnostic meta-learning algorithm that specifically targets the challenging problem of cross-domain few-shot classification. This innovative algorithm acts as an optimization strategy aimed at improving the generalization abilities of learning models. However, the meta-learning based approaches can be hindered by the lack of sufficient training instances for each subtask, making it vulnerable to the discriminative features of the relationships between entity pairs. As a result, this limitation can negatively impact the performance of the model on the FSRE task.

The primary objective of metric-based approaches is to identify a more effective metric for measuring the discrepancy in distribution, allowing for better differentiation between different categories. Among them, the prototypical network [29] is widely used as one of metric-based approaches to few-shot learning and has demonstrated its efficacy in addressing the FSRE task. In this approach, every example is encoded as a vector by the feature extractor. The prototype representation for each relation is obtained by taking the average of all the exemplar vectors belonging to that relation. To classify a query example, its representation is compared to the prototypical representations of the candidate relations, and it is assigned a class based on the nearest neighbor rule. There have been many previous works that enhanced the model performance of different problems in the FSRE task based on prototypical networks. Ye and Ling [36] introduced a modified version of the prototypical network that incorporates multi-level matching and aggregation techniques. Gao et al. [8] proposed a novel approach called hybrid attention-based prototypical networks. This innovative approach aims to mitigate the impact of noise and improve the overall performance of the system. Yang et al. [34] proposed an enhancement to the prototypical network by incorporating relation and entity descriptions. And Yang et al. [35] introduced the inherent notion of entities to offer supplementary cues for relationship extraction, consequently augmenting the overall effectiveness of relationship prediction. Ren et al. [26] introduced a two-stage prototype network that incorporates prototype attention alignment and triple loss, aiming to enhance the performance of their model in complex classification tasks. Han et al. [11] devised an innovative method that leverages supervised contrastive learning and task adaptive focal loss, specifically focusing on hard tasks in FSRE scenario. Brody et al. [2] conducted a comprehensive analysis of the effectiveness of robust few-shot models and investigated the reliance on entity type information in FSRE models. Despite these notable progressions, the task adaptability of FSRE model is still under-explored. In this study, we propose an adaptive prototype network representation with relation-meta features and various instance features, which enhances the adaptability of the model and yields superior performance on relationships with high similarity.

2.2 Adaptability in Few-Shot Learning

Few-shot learning (FSL) is a prominent approach aimed at training models to comprehend and identify new categories using a limited amount of labeled data. However, the inherent limitations of the available information from a restricted number of samples in the N-way-K-shot setting of FSL pose a significant challenge for both task-agnostic and task-specific models to adaptively handle the FSL task. Consequently, the performance of FSL models can exhibit substantial variations across different environments, particularly when employing metrics-based approaches that typically train the task-specific model. One of the challenges in FSL is to enable models to quickly adapt to dynamic environments and previously unseen categories. Simon et al. [28] proposed a framework that incorporates dynamic classifiers constructed from a limited set of samples. This approach enhances the few-shot learning model’s robustness against perturbations. Lai et al. [14] proposed an innovative meta-learning approach that captures a task-adaptive classifier-predictor, enabling the generation of customized classifier weights for few-shot classification tasks. Xiao et al. [33] presented an adaptive mixture mechanism that enhances generation of interactive class prototypes. They also introduced a loss function for joint representation learning, which seamlessly adapts the encoding process for each support instance. Their approach is also built upon the prototypical framework. Han et al. [12] proposed an adaptive instance revaluing network to tackle the biased representation problem. Their proposed method involves an enhanced bilinear instance representatio and the integration of two original structural losses, resulting in a more robust and accurate regulation of the instance revaluation process. Inspired by these works, we introduce relation-meta learning to obtain the common features that were used to rectify the biased representation, making the model generalize better and have stronger adaptability.

3 Methodology

The general architecture of the AdaProto model, proposed by us, is depicted in Fig. 2. This model is designed to tackles the task of extracting relationships between entity pairs in few-shot scenarios. The model consists of four distinct modules:

3.1 Relation-Meta Learning

To obtain relation-meta information from relation instances, we introduce a relation-meta learner, inspired by MetaR [3], that learns the relative common features, thus maintaining the task-specific weights for the relational classes. Unlike the previous method, our approach obtains the relation-meta directly from the relation set, not from the entity pairs, and transfers the common features to entity pairs for rectifying the relation representations. Firstly, we generate a template called relation set by combining the relation name and description as \(``name\text {:}description" (n;d)\), and feed them into the encoder to produce the feature representation. To enhance the relation representation, we employ the mean value of the hidden states of the sentence tokens and concatenate it with their corresponding [CLS] token to represent the relation classes \({\{r_{i} \in R^{2d};i=1,\ldots ,N\}}\). Next, the relation-meta learner compares all the N relation representations obtained from the encoder and generates the only representation specific to each relational label in the task. Finally, to represent the common features through the relation-meta learner, we design a nonlinear two-layer fully connected neural network and a mutual information mechanism to explore the feature representations by training the relation-meta learner.

$$\begin{aligned} R_{m}=GELU(H_{r_{i}}W_{1} + b_{1})W_{2} + b_{2} \end{aligned}$$
(1)

where \(R_m\) is the representation of relation-meta, \(H_{r_{i}}\) is an output of the encoder corresponding to \(r_{i}\), and \(W_i\), \(b_i\) are the learning parameters. The hidden embeddings are further exploited by the fully-connected two layers network with GELU(.) activation function. We assume that the ideal relation-meta \(R_m\) only retains the relation specific to the \(i\text {-}th\) class \(r_{i}\) and is orthogonal to other relations. Therefore, for all relational categories R, when \(i \ne j\), the classes should satisfy their mutual information \(MI(R_{i}, R_{j})\text {=}0\), and the relative discriminative representation of each relation is independent. To achieve this goal, we design a training strategy based on mutual information, which constrains the relative common feature representation of each relation that only contains the information specific to the relational classes by minimizing the mutual information loss function.

$$\begin{aligned} L_m=\ \sum _{1 \le i,j \le N,i \ne j} MI\left( R_i,R_j\right) \end{aligned}$$
(2)

3.2 Entity-Pair Rectification Learning

To transfer the common features and obtain discriminative features, following the previous transE [1], we design score function to evaluate the interaction of entity pairs and common features learned from the relation-meta learner. The function can be formulated as the interaction between the head entity, tail entity, and relations to generate the rectified relation representation:

$$\begin{aligned} s\left( h_i,t_i\right) =\parallel R_m + h_i - t_i \parallel \end{aligned}$$
(3)

where \(R_m\) represents the relation-meta, \(h_i\) is the head entity embedding and \(t_i\) is the tail entity embedding. The \(L_2\) norm is denoted by ||.||. With regards to the relational classes in the given task, the entity-pair rectification learner evaluates the relevance of each instance in the support set. To efficiently evaluate the effectiveness of entity pair, inspired by MetaR [3], we design loss function to update the learned relational classes \(R_m\) with the score of the entity pairs and common features in the following way:

$$\begin{aligned} Ls(S_r)= \sum _{h,t \in S_r} \left( \lambda + s\left( h_i,t_i\right) -s\left( h_i,t'_i\right) \right) \end{aligned}$$
(4)

where \(\lambda \) is the margin hyperparameter, e.g., \(\lambda \)=1, \(t'_i\) is the negative example of \(t_i\), and \(s(h_i, t'_i)\) is the negative instance score corresponding to current positive entity pair \(s\left( h_i,t_i\right) \in S_r\). And the rectified relations \(E_r\) can be represented as follows:

$$\begin{aligned} E_r= R_m - {\bigtriangledown _{R_m}}Ls(S_r) \end{aligned}$$
(5)

where \(\bigtriangledown \) is the gradient.

3.3 Instance Representation Learning

Textual context is the main source of the relation classification, while the instances vary differently in the randomly sampled set, resulting in the discrepancy in relation features, especially for the task with similar relations. The keywords play a vital role in discriminating different relations in a sentence, so we design attention mechanisms to distinguish the relations from instance representation. We allocate varying weights to the tokens based on the similarity between instances with the relations. The \(k \text {-}th\) relational feature representation \({R_i^k}\) is obtained by computing the similarity between each instance embedding \(s_k^i\) and the relation embedding \(r_i\).

$$\begin{aligned} R_i^k=\sum _{n=1}^{l_r} \alpha _r^n r_i^n \end{aligned}$$
(6)

where \(k=1,...,K\) is \(k \text {-}th\) instance, \(l_r\) is the length of \(r_i\), and in the matrix for the instances \(r_i\), \(s_k\), the variable n represents the \(n\text {-}th\) row.

$$\begin{aligned} \alpha _r^n=\text {softmax}(\text {sum}(r_i(s_k^n)^T)) \end{aligned}$$
(7)

The discriminative features are defined by the set of K features.

$$\begin{aligned} X_r=1/K\sum _{k=1}^{K}R_k^i \end{aligned}$$
(8)

3.4 Multi-task Joint Training

We concatenate the common and discriminative features to form the hybrid prototype representation by relation-meta \(R_t\), entity-pair rectification \(E_t\), and instance representation \(X_t\).

$$\begin{aligned} C_j=[R_t^j;E_t^j;X_t^j] \end{aligned}$$
(9)

where t denotes the task, \(R_t^j\) represents the \(j\text {-}th\) relations by the relation-meta learner, \(E_t^j\) represents the updated \(j\text {-}th\) relations by the entity-pair rectification learner, and \(X_t^j\) represents the contextualized \(j\text {-}th\) relations by the instance learner at the task-level learning. The model would calculate the classification probability of the query Q based on the given query and prototype representation of N relations.

$$\begin{aligned} P(y=j| Q,S)=\exp (d(C_j,\ Q))/\sum _{k=1}^{N}\exp (d(C_k,Q)) \end{aligned}$$
(10)

where variable N represents the number of classes, and the function d(.,.) refers to the Euclidean distance. And the training objective in this scenario is to minimize the cross-entropy loss by employing the negative log-likelihood estimation probability with labeled instances:

$$\begin{aligned} L_c\left( Q,S\right) =-\text {log}P\left( y=t|Q,S\right) \end{aligned}$$
(11)

where t denotes the relation label ground truth. There are only a few number of instances for each task and the tasks vary differently in difficulty. Training a model using only cross-entropy can be challenging to achieve satisfactory results. To tackle this issue, we employ a multi-task joint strategy to enhance the model’s training process. Following previous work [4], we use mutual information loss and relax the constraints because of the complexity.

Next, to fully exploit the interaction between the relation-meta and entity pairs, the rectification loss is introduced. Furthermore, we take \(L_f\) with focal loss [17] instead of \(L_c\) to balance the different tasks. The focal loss can be expressed in the following manner:

$$\begin{aligned} L_f\left( Q,S\right) =-\left( 1-P\left( y=t|Q,S\right) \right) ^\gamma \text {log}P\left( y=t|Q,S\right) \end{aligned}$$
(12)

where factor \(\gamma \ge \) 0 is utilized to regulate the rate at which easy examples are down-weighted. Finally, the training loss is constructed as follows:

$$\begin{aligned} L=L_f+\alpha L_m+\beta L_s \end{aligned}$$
(13)

where hyperparameters \(\alpha \), \(\beta \) control the relative importance of \(L_m\), \(L_s\), respectively, e.g., \(\beta \)=0 means no rectified relations. Following previous similar work [30], we simply set \(\alpha \) = 0.7 and \(\beta \) = 0.3, and have obtained better performance. And other rates can be explored in the future.

4 Experiments

4.1 Dataset and Evaluation Metrics

Datasets. Our AdaProto model is assessed on the FewRel 1.0 [13] and FewRel 2.0 [9] datasets, which are publicly available and considered as large-scale datasets for FSRE task. These datasets contain 100 relations, each consisting of 700 labeled instances derived from Wikipedia. To ensure fairness, We divide the corpus of 100 relations into three distinct subsets, with a ratio of 64:16:20, correspondingly dedicated to the tasks of training, validation and testing, following the official benchmarks. The relation set is typically provided in the auxiliary dataset. While the label data for the test set is not made public, we can still assess our model’s performance by submitting our prediction results and using the official test script to obtain test scores.

Evaluation. The distribution of FewRel dataset in various scenarios is frequently simulated using the N-way-K-shot (N-w-K-s) approach. Under the typical N-w-K-s setting, each evaluation episode involves sampling N relations, each of which contains K labeled instances, along with additional query instances. The objective is for the models to accurately classify the query instances into the sampled N relations, based on the provided \(N\) \(\times \) \(K\) labeled data. Accuracy is employed as the performance metric in the N-w-K-s scenario. This metric measures the proportion of correctly predicted instances out of the total number of instances in the dataset. Building upon previous baselines, we have selected N values of 5 and 10, and K values of 1 and 5, resulting in four distinct scenarios.

4.2 Implementation Details

To effectively capture the contextual information of each entity and learn a robust representation for instances, leveraging the methodology introduced by Han et al. [11], we utilize the uncased \(BERT_{base}\) as the encoder, which is a 12-layers transformer and has a 768 hidden size. To generate the statements for each instance, we merge the hidden states of the beginning tokens of both entity mentions together. To enhance the representation of each entity pair, we further incorporate their word embedding, entity id embedding, and entity type embedding. In addition, we utilize the label name and its corresponding description of each relation to generate a template. To optimize our model, we utilize the AdamW optimizer [20] and set the learning rate to 0.00002. Our model is implemented using PyTorch framework and deployed on a server equipped with two NVIDIA RTX A6000 GPUs.

4.3 Comparison to Baselines

We assess the performance of our model by comparing it to a set of strong baseline models:

Proto: Snell et al. [29] proposed the original prototype network algorithm.

GNN: Garcia et al. [10] introduced a meta-learning method based on graph neural networks.

MLMAN: Ye and Ling [36] introduced a modified version of the prototypical network that incorporates multi-level matching and aggregation.

REGRAB: Qu et al. [24] proposed a method for bayesian meta-relational learning that incorporates global relational descriptions.

BERT-PAIR: Gao et al. [9] proposed a novel approach to measure the similarity between pairs of sentences.

ConceptFERE: Yang et al. [35] put forward a methodology that incorporates the inherent concepts of entities, introducing additional cues derived from the entities involved, to enhance relation prediction and elevate the overall effectiveness of relation classification.

DRK: Wang et al. [31] proposed a innovative method for knowledge extraction based on discriminative rules.

HCRP: Han et al. [11] designed a hybrid prototype representation learning method based on contrastive learning, considering task difficulty.

Most existing models use BERT as an encoder, and we follow what is known a priori.

4.4 Main Results

Overall Results. In our evaluation, We compare our proposed AdaProto model against a set of strong baselines on the FewRel dataset. And the results of this comparison are presented in Table 1 and 2. The model’s performance is demonstrated in Table 1, our model effectively improves prediction accuracy and outperforms the strong models in each setting on FewRel 1.0, demonstrating better generalization ability. Besides, the performances gains from the 5-shot settings over the second best method (i.e., HCRP) are larger than those of 1-shot scenario. This observation may suggest that the availability of only one instance per relation class restricts the extraction of discriminative features. And the semantic features of a single instance are more prone to deviating from the common features shared by its relational class. By observational analysis, the performance is mainly due to three reasons. 1) Relation-meta learning allows obtaining common features while adapting to different tasks. 2) Entity pair rectification and instance representation learning empowers prototype network with discriminative features. 3) Joint training also plays an important role.

Table 1. Main results from the validation and testing of the FewRel 1.0. The evaluation metric used in this study is accuracy(%).
Table 2. The performance on the domain adaptation test set of FewRel 2.0.

Performance on FewRel 2.0 Dataset. In order to assess the adaptability of our proposed model, we performed cross-domain experiments using the FewRel 2.0 dataset. Table 2 clearly illustrates the impressive domain adaptability of our model, as evidenced by the results. In particular, our model obtain better results on 5-shot, probably because the local features of the task play a dominant role compared to the common features.

Table 3. The performance on random, easy, and hard task.

Performance on Hard Tasks in Few-Shot Scenarios. To showcase the adaptability and efficacy of our model in handling difficult tasks, we use Han’s setup [11] to assess the overall capabilities of the models on the validation set of FewRel 1.0 dataset. We considered three distinct 3-way-1-shot scenarios and display the main results in Table 3. Easy denotes tasks with easily distinguishable relations, Random encompass a set of 10,000 tasks randomly sampled from the validation relations, and Hard denotes tasks with similar relations. We choose Mother, Child and Spouse as the three difficult relations because they not only have similar relation descriptions but also have the same entity types. The performance degrades sharply for the difficult tasks, as shown in the Table 3, which indicates that the hard task still faces a great challenge and also illustrates the significant advantage of our AdaProto model.

4.5 Ablation Study

To assess the efficacy of the various modules in our AdaProto model, we conducted an ablation study where we disabled each module individually and assessed the impact on the model’s overall performance. Table 4 illustrates the outcomes of the ablation study, showcasing the performance decline observed for each module when disabled.

Table 4. The Ablation study performance of AdaProto model on FewRel 2.0.

Experiments were performed on the validation set using the FewRel 2.0 dataset for domain adaptation. Table 4 presents the detailed experimental results. The second part of the Table 4 indicates the results after removing certain modules. It is observed that every module affects performance, especially for the relation-meta learner. In the common features, we eliminate the relation-meta learner and degrade the model to the discriminative features without relation descriptions, and the performance has a large degradation, indicating that common features contribute more to the FSRE task. In the discriminate features, we remove the entity pair rectification and instance representation, respectively, and the results demonstrate that textual context is still the main source of the discriminative features. In addition, removing entity pair rectification also shows a more significant decrease in performance.

5 Conclusion

In this study, we propose AdaProto, an adaptive prototype network representation that incorporates relation-meta features and various instance features to address the challenges in FSRE tasks. The adaptability of the task heavily relies on the effectiveness of the embedding space. To enhance the model’s generalization, we introduce relation-meta learning to learn common features and better capture the class-centered prototype representation. Additionally, we propose entity pair rectification and instance representation learning to identify discriminative features that differentiate similar relations. Our approach also utilizes a multi-task training strategy to ensure the development of a high-quality model. Furthermore, the model can adapt to different tasks by leveraging multiple feature extractors. Through comparative studies on FewRel in various few-shot settings, we demonstrate that AdaProto, along with each of its components, boosts the classification performance of FSRE tasks and effectively enhances the overall effectiveness and robustness.