Keywords

1 Introduction

In the modern age, there has been an increase in data. These data are mostly stored in the electronic form. The most common is the textual form in which information is stored in an unstructured manner. In order to this data to be useful, we should be able to retrieve most important information from this text seamlessly.

In this work, we propose a deep-learning based approach to extract relational triples in text. Relational triples are entities in a sentence which are in the form subject - predicate - objects. These relational triples can then be used for knowledge engineering applications.

The relational triples are extracted from unstructured text using a DistilBERT [1, 2] based transformer language model. One of the major highlights of our transformer-based model is that it will be able to capture dependencies in long sentences. Our model is also capable of extracting sentences with overlapping entities. This is a case where triples share same entities and relations. This scenario is explained in detail in the coming section. The final important aspect of our model is the joint entity-relation extraction. In the earlier models, entities and relations were separately learned in a pipe-lined manner, which resulted in error propagation from one stage to another.

1.1 Overlapping Entities

Earlier models for this task were not able to handle sentences with overlapping entities. This is a scenario where entities are shared by multiple triples in same sentence. This can be categorized into mainly two types: Single Entity Overlapping (SEO) and Entity Pair Overlapping (EPO). Single Entity Overlapping occurs when multiple triples have same entity shared as subject or object. Entity Pair Overlap occurs when multiple triples have same entity pairs. A visual representation of these scenarios is given in Fig. 1.

Fig. 1.
figure 1

Categories of triples in a sentence. Subject, Predicate and Object suffixes are added to the entities

1.2 Transformers

Most of the state-of-the-art models in the Natural Language Processing domain currently uses Transformer based language models. Transformer models are a new type of deep learning model [3] which uses attention mechanism to find global dependencies in input and output. The transformer model processes sentences in a non-sequential manner which helps in processing sentence as whole rather than word by word. All the above features were not prevalent in earlier deep neural network-based models for same task. However, most of the state-of the-art transformers have very high number of layers and parameters.

In our work, we are mainly using encoder mechanism of the transformer for language modelling. Our model can be summarized as follows. First raw textual inputs are converted into tokens using tokenizer. Then these tokens along with masks are fed into the encoder module to get the embedding. Since we’re using DistilBERT based model, the output embedding is contextual in nature. This embedding along with the embedding of triplet labels are fed into the model for training. The loss is calculated in propagation with subject’s and object’s head and tail position in the sentence. The final output is the subject’s and object’s head and tail position in the sentence along with the relation.

2 Related Works

In the Information Extraction or Relation Extraction domain, one of the earlier notable work is [4] extracting features using Support Vector Machines. Later [5] approached the problem with a two-step solution, first is finding all entities using Named Entity Recognition (NER) and then classifying all the extracted entity pairs using relation classification (RC). These pipeline-based approaches however suffered from error propagation problem. To address this issue, joint models [6] have been proposed which learns entities and relations together. The earlier works, however did not addressed the problem of overlapping entities encountered in a sentence i.e. multiple triples in same sentence sharing same entities. This problem was only recently addressed using deep neural network based models in the works of [7], which is based on sequence-to-sequence learning with copy mechanism using Bi-directional LSTM. Later the evaluation scores were improved by [8] using Graph Convolutional Networks and Bi-LSTMs. The recent works by [9] and [10] further improves the evaluation scores using BERT based transformer language model. Other recent works involving the usage of transformers in knowledge extractions include [11,12,13].

3 Dataset

For training and testing of our relation extraction framework, we are using two public dataset, New York Times dataset and WebNLG dataset. The original NYT dataset [14] was created with distant supervision approach and WebNLG dataset [15] for Natural Language Generation. These datasets have been modified as per the requirement [7]. The resulting NYT dataset consists of 24 classes, 56195 training data, 5000 validation data and 5000 test data. The WebNLG dataset consists of total 5019 training data, 500 validation data and 703 test data. Detailed information is given in Table 1.

Total train data in each dataset used for training, however testing is done on individual component. The testing data can be classified into three types Normal, Entity Pair Overlap and Single Entity Overlap. The testing data can be further classified on basis of number of relational triples exists on a single sentence. All the testing data without categorized is marked as main. Tabulated information of the dataset is given in Table 2. In the Table 2, for the rows ‘Triple-i’, i denotes number of triples in a single sentence.

Table 1. Dataset information
Table 2. Categorization of testing data

4 Model Architecture

For the relational extraction model, we followed the work of [9, 10] which was implemented using BERT based encoder and Graph Neural network. We have optimized the size of the same using DistilBERT based transformer framework without using Graph Network Layer from baseline as accuracy gains from Graph Neural Network was negligible in our experiments when considering number of trainable parameters it added. This allowed us to significantly reduce the trainable parameters without compromising much of accuracy as well as lowering the model training time.

For relation extraction framework (Fig. 2), our work consists of two parts: encoding words from input sentence into vector embeddings and encoding each relation into vectors and then subject and object tagger based relational triple extraction.

The problem can be formulated as mentioned. Given a sentence x, and set of all triplets (s,r,o) in training set T, our goal is to maximize the data-likelihood in the training set. This can be mathematically defined as mention in Eq. 1:

Fig. 2.
figure 2

Architecture of our Relation Extraction model

$$ \begin{aligned} & \,\,\,\,\,\prod\limits_{{(s,r,o) \in T}} {p((s,r,o)|x)} \\ & = \prod\limits_{{s \in T}} p (s|x)\prod\limits_{{(r,o) \in T{\mid }s}} {p((r,o)|x,s)} \\ & = \prod\limits_{{s \in T}} p (s|x)\prod\limits_{{r \in T{\mid }s}} p (o|x,s,r)\prod _{{r \in R\backslash T{\mid }s}} p\left( {o_{\emptyset } |x,s,r} \right) \\ \end{aligned} $$
(1)

where T | s is the triplet set with s as subject in T. Similarly, (r,o) ∈ T | s is the set of all relation-object pair in T. R is the set of all relations and R\T | s means all the relations except subject s in T. \(o_{\emptyset} \) represents all relations except those in triplet T |s will have no corresponding objects.

First, for a given input sentence, a pre-trained DistilBERT encoder is used for extracting tokens for each word and for each predefined relation, an embedding is created as shown in the Eq. 2.

$$ \begin{array}{*{20}c} {\left[ {h_{1} ,h_{2} , \cdots h_{n} } \right] = E_{D} \left( {\left[ {w_{1} ,w_{2} \ldots w_{n} } \right]} \right)} \\ {\left[ {p_{1} ,p_{2} \ldots p_{m} } \right] = W_{r} E\left( {\left[ {r_{1} ,r_{2} \ldots r_{m} } \right]} \right) + b_{r} } \\ \end{array} $$
(2)

where wi is word from input sentence and hi is the output token from DistilBERT encoder ED Similarly, pi is the output after relation embedding matrix E embeds predefined relations ri. Wr and br are trainable parameters.

For relation extraction, subject taggers and object taggers are used. The subject tagger defined in Eq. 4 will identify all possible subjects in the word nodes. More specifically, it will tag the head and tail of the subject using sigmoid function, defined in Eq. 3.

$$ \sigma (x) = \frac{1}{{1 + e^{ - x} }} $$
(3)

The sigmoid function maps the values between 0 and 1.

$$ \begin{gathered} P_{i}^{{ss_{-} head}} = \sigma \left( {W_{{s_{ - } head}} {\text{Tanh}} \left( {h_{i}^{o} } \right) + b_{{s_{ - } head}} } \right) \hfill \\ \,\,\,\,P_{i}^{{s_{t} tail}} = \sigma \left( {W_{{s_{ - } {\text{tail }}}} {\text{Tanh}} \left( {h_{i}^{o} } \right) + b_{{s_{ - } tail}} } \right) \hfill \\ \end{gathered} $$
(4)

where \(P_{i}^{s\_head} , P_{i}^{s\_tail}\) are the probabilities of identifying the ith word as head and tail position of the subject respectively which is calculated by the sigmoid function σ. The values \(W_{{s_{ - } head \, }} ,W_{{s_{ - } tail \, }} ,b_{{s_{ - } h_{head \, } }} ,b_{{s_{ - } tail }}\) are trainable weights. \(h_{i}^{o}\) is the encoded representation of the word from previous stage.

Similarly, the object tagger, defined in Eq. 5 uses encoded word token which is different from token used by subject tagger.

$$ \begin{gathered} P_{i}^{{o\_head}} = \sigma \left( {W_{{o_{ - } head}} \overline{h}_{ijk} + b_{{o_{ - } head}} } \right) \hfill \\ \,\,\,\,P_{i}^{{o_{ - } tail}} = \sigma \left( {W_{{o_{ - } tail}} \overline{h}_{ijk} + b_{{o_{ - } {\text{tail }}}} } \right) \hfill \\ \end{gathered} $$
(5)

where \(P_{i}^{o\_head } ,P_{i}^{{o_{ - } tail }}\) are the probabilities of identifying the ith word as the head and tail position of the object respectively which is calculated by the sigmoid function σ. The values \(W_{{o\_{head}}} , \, W_{{o\_{tail} }} , \, b_{{o\_{head} }} , \, b_{{o\_{tail} }}\) are trainable weights. The term \(\overline{h}_{ijk}\) is encoded word token representation which can be defined as

$$ \overline{h}_{ijk} = {\text{Tanh}} \left( {W_{h} \left[ {s_{k} ;p_{j}^{0} ;h_{i}^{0} } \right] + b_{h} } \right) $$
(6)

where sk is the subject representation of the kth candidate subject, p0j and hoi are the encoded representation of the pre-defined relation and word token respectively.

Therefore, in line with Eq. 1, we can define subject tagger and object tagger as Eq. 7 and 8 respectively:

$$ P_{{\theta_{s} }} (s|x) = \prod\limits_{{t \in \left\{ {s_{ - } head \, ,s_{ - } tail} \right\}}} {\prod\limits_{i = 1}^{N} {\left( {P_{i}^{t} } \right)^{{{\rm I}\left\{ {y_{i}^{t} = 1} \right\}}} } } \left( {1 - P_{i}^{t} } \right)^{{{\rm I}\left\{ {y_{i}^{t} = 0} \right\}}} $$
(7)
$$ P_{{\theta_{o} }} (o|x,s,r) = \prod\limits_{{t \in \left\{ {o_{ - } head,o\_tail} \right\}}} {\prod\limits_{i = 1}^{N} {\left( {P_{i}^{t} } \right)^{{{\rm I}\left\{ {y_{i}^{t} = 1} \right\}}} } } \left( {1 - P_{i}^{t} } \right)^{{{\rm I}\left\{ {y_{i}^{t} = 0} \right\}}} $$
(8)

where θs and θo are the parameters of the subject tagger and object tagger respectively. I{z} = 1 if z is true otherwise it is 0 \(y_{i}^{{s\_head}} ,y_{i}^{{s\_tail}} \, {\rm and} \, y_{i}^{{o\_head}} , \, y_{i}^{{o\_tail}}\) are binary tags of subject’s and object’s heads and tails respectively for the ith word in x, For the null object o in Eq. 1, \(y_{i}^{{o_{\varnothing \_head} }} = y_{i}^{{o\varnothing\_tail}} = 0\) for all i.

Taking the logarithm of 1, we get the objective function which is defined in Eq. 9

$$ \begin{aligned} L = & \,\log \prod\limits_{{(s,r,o) \in T}} {p((s,r,o)|x)} \\ = & \sum\limits_{{s \in T_{j} }} {\log } p_{{\theta _{s} }} (s|x) + \sum\limits_{{r \in T_{j} {\mid }s}} {\log } p_{{\theta _{o} }} (o|x,s,r) + \sum _{{r \in R\backslash T_{j} |s}} \log p_{{\theta _{o} }} (o\emptyset |x,s,r) \\ \end{aligned} $$
(9)

The log-likelihood function is then maximized by using Stochastic Gradient Descent during training. The learning rate is set as 0.1 for both datasets.

5 Evaluation Metrics

We used precision, recall and F1-scores as evaluation metrics following the baseline approach. A triplet is considered correct only if its predicate and its corresponding subject and object is correct. Additionally, we also used number of trainable parameter in transformer model for comparison as it will help us to identify the efficiency of the model with respect to neural network size as well as gives us an idea of model training time.

6 Implementation Details and Results

The model is implemented with PyTorch library along with CUDA 11. Base DistilBERT model is used from Huggingface [16] with transformer library version 4.12. For both datasets, the models are set to run on maximum of 60 epochs with an early stopping mechanism. The early stopping mechanism will be triggered if there is no improvement in the score for 15 consecutive epochs. Both of the datasets used Stochastic Gradient Boost optimizer with a learning rate of 0.1. The training data is further split into training and validation data. The hyperparameters are determined from this validation data.

We were able to significantly reduce the number of trainable parameters. A comparison of trainable parameters with other transformer-based model is given in Table 3.

Table 3. Comparison of trainable parameters

The detailed result of our model from testing of different categories of testing data is tabulated and given in Table 4. It is observable that our model performed fairly good in all triple category scenarios. A slight drop in the score in WebNLG dataset when compared with NYT dataset maybe attributed to the fact that WebNLG main category has most of the data in SEO and EPO form. For the NYT dataset, our model performed the best when there were 4 triples in the sentence and for the WebNLG dataset, the model performed well when there were 3 triples in the sentence. Therefore, from these results, it is evident that our transformer based model is perfectly capable of handling complex scenarios in relational triple extraction.

Table 4. Evaluation results on NYT and WebNLG dataset

7 Conclusion and Future Scope

In this paper, we proposed a light version of transformer-based model for Relation Extraction based on joint entity-relation extraction framework. Our model performed well in all triplet overlapping scenarios such as Entity Pair Overlapping (EPO) and Single Entity Overlapping (SEO) and can extract multiple triplets from same sentence while reducing the number of trainable parameters in the transformer. In the future, we aim to reduce the number of trainable parameters further while improving the performance.