1 Introduction

To develop automatic tools that can deal with large amount of text information efficiently is a traditional research topic. Natural language processing (NLP) is such a technology that enables machine to process human language. NLP has been quickly evolved with the development of artificial intelligence technology. With the fast growing of Web, the scale of unstructured text online is also increasing with an astonishing speed. Thus, machine learning, an automatic learning method based on data, has been proposed to solve this problem. At the same time, some applications such as information extraction and machine translation have emerged. To implement such applications, many innovative technologies are proposed. Among them, named entity recognition is a key task to fulfill and it is the basis of text information comprehension and processing.

However, traditional named entity recognition only considers how to identify entities in a small set of types. Therefore, the task of entity type classification is becoming more popular with the growing expectation of the researcher. This task aims to assign respective semantic types to entity mentions in their context. MUC-7 (Krupka and Hausman 1998) defined the three common types: Person, Organization, and Location. CoNLL03 added a miscellaneous type (Sang and Meulder 2003). Recent work also suggests that we can use a larger set of fine-grained types to make improvement for the NLP applications.

Existing fine-grained type classification systems have used approaches obtaining the entity local context information through a fixed window. However, local context with fixed window may lead to ambiguity because of the inadequate external information. To solve this problem, this paper proposed a novel fine-grained entity type classification method for unstructured text based on adaptive context information which not only includes global information, the local context in a sliding window is also used. The proposed method can locate the context information by finding the structure of the sentence structure, and the context information of entities can be accessed more efficiently. Therefore, by mining the text of the paragraph where the entity is located and using automatic summarization technique to obtain the global information of the text, our proposed method can reduce the ambiguity of entity classification, and the accuracy of entity classification can be improved. Experiments on the proposed methods have been taken on two public datasets: FIGER and OntoNotes. The validity of the proposed method can be proved by comparing the experimental results with the results of the related methods. The loose micro-F score for our method is 75.35% on FIGER dataset and 65.35% on OntoNotes dataset.

The remaining of this paper is organized as the follows: Sect. 2 presents the related researches of the fine-grained entity type classification in the literature, Sect. 3 describes the details of our proposed method, Sect. 4 presents the experiment results, and Sect. 5 draws some conclusion.

2 Related work

Most named entity recognition systems only support a small set of types. However, it is far from adequate for NLP tasks. For example, in the question answering task, we need to know the exact type of the candidate answers such as Event, Tools or Product. So the task of fine-grained type classification is widely researched in the literature.

Recently, many researches have focused on classifying the entity mentions in the text to fine-grained types. Although recently researchers proposed ways to deal with vague knowledge (Singh and Kumar 2015), most of current studies still assume the knowledge is well defined. Fleischman and Hovy (2002) classified mentions into eight subtypes of Person type by a decision tree based on local contextual word features and WordNet synonyms. Giuliano and Gliozzo (2008) proposed a method to further classifying the entity into 21 subtypes of Person type. To the best of our knowledge, the first one to do the fine-grained entity type classification was Lee et al. (2006). They defined 147 fine-grained types. Their main purpose was to apply the types into question answering systems and they trained and evaluated a conditional random field model on a manually annotated Korean dataset. Sekine (2009) defined 200 coarse types which could serve as primitives for fine-grained types. And they emphasized the necessity of a large set of types for entity type classification. Rahman amd Ng (2010) defined a type system which contained 29 types and 92 subtypes.

Xiao and Weld (2012b) derived 112 types from Freebase and automatically created the training data from Wikipedia by distant supervision method (Mintz et al. 2009). In addition, they created both a training and evaluation dataset FIGER of newspaper articles. And then, they demonstrated that their system could improve the accuracy of relation extraction system by fine-grained types. Nevertheless, there is an argument that fine-grained types should be classified in a hierarchical taxonomy. So Yosef et al. (2015) organized 505 types from YAGO (Hoffart et al. 2013) in such a hierarchical taxonomy, and the deepest one reached 9 layers. The results could be improved by using this set of types on FIGER dataset. In addition, they developed a multi-label hierarchical classification system. Corro et al. (2015) used a similar method to introduce a system which is the most fine-grained entity type classification system until now, and it covered more than 16,000 types in the WordNet hierarchy.

Most of the above methods assumed that type classification could be done independently without context information of entity mention. (Gillick et al. 2016) first introduced the fine-grained classification with context information. The type labels were limited to what could be deduced from entity mention context. Moreover, they introduced a new OntoNotes-derived manually annotated evaluation dataset and addressed the label noise problem which is induced by distant supervision. Ren et al. (2016) proposed a method to further reducing the label noise, and the performance on the FIGER dataset and OntoNotes dataset was improved. Moreover, Yogatama et al. (2015) had proposed a method to map the manually crafted features and type labels to embeddings so that the information can be shared between both related types and features. (Munkhdalai et al. 2015) proposed an Active Co-Training Algorithm for Biomedical Named-Entity Recognition. The proposed method tends to efficiently exploit a large amount of unlabeled data by selecting a small number of examples that have useful information and comprehensive pattern. Viswanathan and Krishnamurthi (2012) presented a Modified bidirectional breadth-first search algorithm for finding paths between two entities which pass through other intermediate entities and the paths are ranked according to the users’ needs. Vijayarajan et al. (2016) proposed a generic framework for ontology-based information retrieval and image retrieval in web data.

Different from previous models that relied on manually crafted features, Dong et al. (2015) first introduced a hybrid neural model without which consisted of two parts to classify entity mentions to a wide-coverage set of 22 types derived from DBpedia. They used recurrent neural networks to recursively obtain the entity mention representation and used multi-layers perceptron to obtain the context representation. This model didn’t use any external resources. After that, Shimaoka et al. (2016, 2017) used recursive neural networks to compose context representations and employed an attention mechanism to allow the model to focus on relevant expressions. On this basis, they (Shimaoka et al. 2017) combined learnt and manually crafted features and used a hierarchical encoding of labels that enabled us to share parameters between labels in the same hierarchy. Recently, Gotti and Langlais (2016) described a recall-oriented open information extraction system designed to extract knowledge from French corpora. Their research is the first one that focus on such a cross-domain, recall-oriented approach in open information extraction. Cui et al. (2017) proposed a hybrid neural network model for type classification of entity mentions with a fine-grained taxonomy. Experimental results demonstrate that our model achieves state-of-the-art performance on the FIGER dataset. Barua and Patel (2017) proposed to use Search Engine’s Query suggestion as external knowledge source instead of gazetteers for named entity classification in NER systems. The experiments on MSM Challenge dataset demonstrate that QS-NEC is efficient in classification of entity mentions and can be effectively used in NER systems.

3 The model for fine-grained entity type classification with adaptive context

3.1 Overall model

We proposed a novel LSTM based model to achieve the objective of fine-grained type classification.

We first define entity mention as follows:

$$\begin{aligned} E_i \in E \, \left( {1<i<E_\mathrm{num}}\right) \end{aligned}$$
(1)

where E represents a set of entity and \(E_\mathrm{num}\) is the size of the set. Then we need to get the local context of the entity mention through the sentence structure. So the context can be defined as follows:

$$\begin{aligned}&l_i \in L \, \left( {1<i<K}\right) \end{aligned}$$
(2)
$$\begin{aligned}&r_i \in R \, \left( {1<i<R_\mathrm{num}}\right) \end{aligned}$$
(3)

where L represents the word set of left context of entity mention, K represents the size of LR represents the word set of left context of entity mention, and T represents the size of R.

After that, we find the global context with sentence and obtain the abstract through automatic summarization technique. Each word in the abstract can be defined as follows:

$$\begin{aligned} g_i \in G \, \left( {1<i<G_\mathrm{num}}\right) \end{aligned}$$
(4)

where G represents the set of words in the abstract, \(G_\mathrm{num}\) is the size of the set. In addition, we should obtain the manually crafted features of each entity mention. We will describe it in detail in the following section.

Fig. 1
figure 1

Model of fine-grained entity type classification

Table 1 Manually crafted features

Finally, we merge the following four parts and feed them into two dense layers, then input the outputs from dense layers into the Softmax Layer to compute the probability:

  • entity mention representation \(r_\mathrm{e} \),

  • local context representation \(r_\mathrm{c}\),

  • global context representation \(r_\mathrm{g}\) and

  • manually crafted feature representation \(r_\mathrm{f}\).

For each input \(x_i\) has a unique label \(y_i\):

$$\begin{aligned}&\left\{ {\left( {x_1, y_1}\right) , \left( {x_2, y_2} \right) ,\ldots ,\left( {x_n, y_n}\right) }\right\} y_i \in \left\{ {1,2,\ldots ,D}\right\} \end{aligned}$$
(5)
$$\begin{aligned}&x=\left[ {r_\mathrm{e}\,r_\mathrm{c}\, r_\mathrm{g}\, r_\mathrm{f}}\right] \end{aligned}$$
(6)

where n is the number of input and D is the size of the set of label. Given an input x, we can compute the probability \(p(y=d\,|\,e)\) for each type d:

$$\begin{aligned} h(x_i )=\left[ { \begin{array}{l} p\left( {y_i =1\,|\,x_i ;W_y}\right) \\ p\left( {y_i =2\,|\,x_i ;W_y}\right) \\ ... \\ p\left( {y_i =D\,|\,x_i ;W_y}\right) \\ \end{array}}\right] =\frac{1}{\sum _{d=1}^D {e^{W_{yd} x_i}}}\left[ { \begin{array}{l} e^{\left( {W_{y1} x_i}\right) } \\ e^{\left( {W_{y2} x_i}\right) } \\ ... \\ e^{\left( {W_{yD} x_i}\right) } \\ \end{array}}\right] \nonumber \\ \end{aligned}$$
(7)

We first assign the type d to the entity if \(y_{d}\) is the maximum. Then we assign the additional types d that \(y_{d}\) is greater than a threshold \(\gamma \) so that the type d can be predicted. The overview of our model is shown in Fig. 1.

3.2 Manually crafted features

For each entity mention, there will be a binary feature indicator vector \(f(e)=\left\{ {0,1}\right\} ^{D_f \times 1}\). We can feed it to the Softmax Layer with the other three representations. The manually crafted features are shown in Table 1. The example that used in Table 1 to extract the manually crafted features is “... the person who [Brack H.Obama] first picked ...”. The features we used are similar with Gillick (Dan et al. 2016) and Yogatama (Yogatama et al. 2015), and the same as Shimaoka (Shimaoka et al. 2017). In this paper, we use the clustering method (Brown et al. 1992) which is more widely that can make the clusters publicly available. In addition, we use LDA (Blei et al. 2003) to learn a set of 15 topics.

We map the vector f(e) to a low-dimensional projection to compute the manually crafted feature representation \(r_\mathrm{f} \in R^{D_l \times 1}\):

$$\begin{aligned} r_\mathrm{f} =W_\mathrm{f} f(e) \end{aligned}$$
(8)

where \(W_\mathrm{f} \in R^{D_l \times D_f}\) is the mapping matrix.

3.3 Entity mention representation

Given an entity mention, we need to compute the average of all the embeddings of the words in the entity mention. Because an entity cannot just be composed by one word. For example, “New York” is an entity mention which consists of two words. Formally speaking, the word of entity mention is \(e_i (1\le i\le e_\mathrm{num}), e_{num}\) is the size of the entity mention. Then we compute the entity mention representation as follows:

$$\begin{aligned} r_\mathrm{e} =\frac{1}{e_\mathrm{num}}\sum _{i=1}^{e_\mathrm{num}} {u(e{}_i)} \end{aligned}$$
(9)

where u is a mapping from word to an embedding. \(r_\mathrm{e} \in R^{D_e \times 1}\) is the mention representation. We compute the representation in such a way since the previous method may lead to overfitting.

3.4 Local context representation

Given an entity mention, we need to obtain its local context information to predict its type. In previous work, most type classification methods used fixed window size on both left and right of entity mention to get context. However, the local contextual information obtained in such a way may lead to missing of key information if the sentence length is too long. To solve this problem, in our method we employ a sliding window mechanism that can change the window size adaptively. Specifically, we determine the window size to obtain the local context by determining the boundary of sentence. Formally speaking, assume the left side context is \(l_1,\ldots ,l_K\) and right side context is \(r_1, \ldots ,r_T\), where K is the window size of the left context while T is the windows size of the right context in the sentence. The specific steps to get adaptive context are shown in Algorithm 1.

figure a

We use the combination of bidirectional LSTMs (Graves 2012) and attention mechanism to compute the local context representation. Compared with traditional LSTM, bidirectional LSTMs can get much more information for it considers the words order in the sentence. And attention mechanism can let the model pay more attention to relevant expressions. The computation is as follows:

First, the outputs of the bidirectional LSTMs are \(\overrightarrow{h_1^l},\overleftarrow{h_1^l},\ldots ,\overrightarrow{h_K^l},\overleftarrow{h_K^l}\) (left context) and \(\overrightarrow{h_1^r},\overleftarrow{h_1^r},\ldots ,\overrightarrow{h_T^r},\overleftarrow{h_T^r}\) (right context). For each output layer, we use a two-layer feed-forward neural network \(v_i \in R^{D_a \times 1}\) and weight matrices \(W_d \in R^{D_a \times 2D_h}\) and \(W_a \in R^{1\times D_a}\) to compute a scalar value \({{a}}_i^l\):

$$\begin{aligned} v_i^l= & {} \tanh \left( W\left[ {_{\overleftarrow{h_i^l}}^{\overrightarrow{h_i^l}}}\right] \right) \end{aligned}$$
(10)
$$\begin{aligned} {{a}}_i^l= & {} \exp \left( W_a v_i^l \right) \end{aligned}$$
(11)

Then, we normalize the scalar values so that the sum is 1:

$$\begin{aligned} a_i^l =\frac{{{a}}_i^l}{\sum _{i=1}^K {{{a}}_i^l} +\sum _{i=1}^T {{{a}}_i^r}} \end{aligned}$$
(12)

The scalar values \(a_i \in R\) are called attentions. Similarly, the right context is computed in the same way. Finally, we compute the sum of output of the bidirectional LSTM as the local context representation:

$$\begin{aligned} r_\mathrm{c} =\sum _{i=1}^K {a_i^l \left[ {_{\overleftarrow{h_i^l}}^{\overrightarrow{h_i^l}}}\right] } +\sum _{i=1}^T {a_i^r \left[ {_{\overleftarrow{h_i^r}}^{\overrightarrow{h_i^r}}}\right] } \end{aligned}$$
(13)

3.5 Global context representation

Given an entity mention and the sentence that contains the entity, the document which contains those sentences is traditionally treated as global context. Nevertheless, a document always has redundant information or noise by itself. To avoid this problem, we apply automatic summarization for the objective document to get the most relevant information.

The automatic summarization algorithm we used in this paper added the semantic information and simplified steps for traditional algorithm based on rules to improve the accuracy and simplicity of the abstract. The flowchart of the algorithm is shown in Fig. 2.

Fig. 2
figure 2

Flowchart of the automatic summarization algorithm

Firstly, we need to create a word chart, the node of the word chart can be a word, a sentence or even a document. In this paper, we select the word to be the node and we use the vocabulary co-occurrence to construct the edge of the graph. If two words occur in the same sentence, we will connect these two word nodes. This method is called vocabulary co-occurrence. However, too many words in the document may lead to too many edges. And there has no relation between two distant words in the same sentence, so that it may result in a lot of interference edge. In order to address this problem, we optimize it by the introduction of dependency parser. The comparison of two methods is shown in Table 2. Then, we can construct the word graph. The example used in the word graph is “Alice, who had been reading about Spacy, saw Bob in the library.” And it is shown in Fig. 3. After completing the word graph construction, we need to compute the importance of words. This paper use the classical graph model algorithm HITS (Kleinberg 1999). The basic idea of the algorithm is to enhance the relationship with each others, and the specific assumptions are as follows:

Table 2 Comparison of co-occurrence method and dependency parser method to construct a word graph
Fig. 3
figure 3

An example of a word graph

  • Authority Score: A node is an important node if it point to many important nodes.

  • Hub Score: A node is an important node if it is pointed to by many important nodes.

The formulas of the two assumptions are as follows:

$$\begin{aligned} \hbox {HITS}_A (V_i )= & {} \sum _{V_j \in In(V_i)} {\hbox {HITS}_H (V_j)} \end{aligned}$$
(14)
$$\begin{aligned} \hbox {HITS}_H (V_i )= & {} \sum _{V_j \in Out(V_i)} {\hbox {HITS}_A (V_j)} \end{aligned}$$
(15)

Let the sum of the authority score and hub score be the importance score of the word. And then, we compute the sum of the importance score of each word as the importance score of the sentence:

$$\begin{aligned} \hbox {Score}(S_t )=\sum _{w_i \in S_t} {\hbox {Score}(w_i)} \end{aligned}$$
(16)
Fig. 4
figure 4

An illustration of hierarchical label encoding

Finally, we select the top-ranked sentence as the abstract of the document.

3.6 Hierarchical label encoding

Fine-grained entity type classification has a big difference from traditional classification in that it tend to form a forest of type hierarchies. For example, teacher is a subtype of education, while education is a subtype of person. We can enable the parameter sharing by hierarchical label encoding, because some co-occurrence labels will be closer in this space. For instance, the candidate entity mention labels are person, artist and location. So the type person and artist are closer. Concretely, we compute the weight matrix \(W_y\) for the Softmax Layer with learnt weight matrix \(V_y\) and a constant sparse binary matrix S:

The illustration of hierarchical label encoding is shown in Fig. 4. Each type is mapped to a unique column in S. For example, the column for /person is encoded as [1, 0, 0, 0, ...], /person/education is encoded as [1, 1, 0, 0, ...], and /person/education/student is encoded as [1, 1, 1, 0, ...].

This method can make the parameters be shared between labels in the same hierarchy.

$$\begin{aligned} W_y^T =V_y S \end{aligned}$$
(17)

4 Experiment results and analysis

4.1 Overview

The overall flowchart of our experiment is shown in Fig. 5. The experiment can be divided into several parts as follows:

  1. 1.

    After dealing with the corpus, we can obtain the entity mention representation and local context representation easier.

  2. 2.

    Getting the global context representation by automatic summarization technique.

  3. 3.

    The four representations is fed into the model as input. Then, we can adjust the parameters to obtain the best model. Finally, we use the test set to evaluate our model.

In our experiment, we use Ubuntu 16.04 as experimental environment. And we use Python 2.7 as our development language. In addition, we use Keras 2.0.4 and TensorFlow 0.11 as our framework. We employ the lower version of TensorFlow for better compatibility with FIGER and OntoNotes Datasets.

Fig. 5
figure 5

Flowchart of the experiment

4.2 Dataset and word embedding

To train and evaluate our model, we use two publicly available datasets. One is OntoNotes (Dan et al. 2016) which consists of 13,109 news documents where 77 test documents are manually annotated. The other is FIGER (Xiao and Weld 2012a) which contains 112 fine-grained types. The specific types of two datasets are shown in Figs. 6 and 7, respectively.

We use 300-dimensional cased word embeddings trained on 840 billion tokens by Glove algorithm (Pennington et al. 2014). We use the pre-trained word embeddings to converge more quickly, saving training time and improve the accuracy of the model. For the absent words in the pre-trained word embeddings, we use the embeddings of the “unk” token.

4.3 Entity mention and context

Entity mention and local context can be obtained in one sentence, so we can get them jointly. First, we tag the location of the entity mention and manually crafted features. We obtain the entity mention by its location. Then, we label the word “BEG” at the beginning of the sentence and the word “END” at the end of the sentence to find the boundary of the sentence so that we can obtain the local context of the entity mention.

As to the global context, first, we find the document that contains the sentence occurs. Then, we construct the word graph with dependency parser for each sentence after segmenting the document text to sentences. After that, we obtain the abstract using the method mentioned before.

One of the most important tasks for processing text in English is to identify the punctuation marks. Full stop “.” is usually used in English to represent the abbreviated word. For example, the word “a.m” is the meaning of morning, but this“.” cannot represent the end of a sentence. Another similar example is the sentence “I like U.S.A.”. In this sentence, the last“.” represents not only the end of the sentence but also the abbreviated word “U.S.A.”. In this paper, we use Punkt Sentence Tokenizer of NLTK to complete the segmentation. It detects the sentence boundary without semantic information, so it can handle the abbreviation problem well.

Fig. 6
figure 6

Types in OntoNotes dataset

Fig. 7
figure 7

Types in FIGER dataset

Then, we use Standford Dependency Parser tool in NLTK to do the task of dependency parser. An example for dependency parser is shown in Table 3.

After screening for various dependency relations, this paper selects several classical relations to construct graph model. The relations selected in this paper are shown in Table 4.

Then we use HITs algorithm to compute the weight of each sentence. After ranking the sentence by their weights, we select the top-two sentences as the abstract in this paper.

4.4 Model parameter settings

The parameter setting of our proposed model as shown in Fig. 1 is shown in Tables 5 and 6. Table 5 represents the parameter settings for the entire network. Table 6 represents the parameter settings for network layer details.

Table 3 An example of dependency parser
Table 4 Dependency relations in this paper
Table 5 Parameter settings 1
Table 6 Parameter settings 2

Loss function is a function which aims to evaluate the degree of inconsistency between the predicted value and the true value. In this paper, we adopt the cross entropy loss function as our loss function. Formally speaking, for each input \(x_i\) has a unique label \(y_i\):

$$\begin{aligned}&\left\{ {\left( {x_1, y_1}\right) , \left( {x_2, y_2}\right) ,\ldots ,\left( {x_n, y_n}\right) }\right\} y_i \in \left\{ {1,2,\ldots ,D}\right\} \end{aligned}$$
(18)
$$\begin{aligned}&x=\left[ {r_\mathrm{e}\, r_\mathrm{c}\, r_\mathrm{g}\, r_\mathrm{f}}\right] \end{aligned}$$
(19)

where n is the number of input and D is the size of the set of label. Then we can compute the loss function L:

$$\begin{aligned} L=-\frac{1}{n}\left[ {\sum _{i=1}^n {\sum _{i=1}^D {1\left\{ {y_i =d}\right\} \log \frac{e^{W_{yd} x_i}}{\sum _{j=1}^D {e^{W_{yj} x_i}}}}}}\right] \end{aligned}$$
(20)

where \(1\left\{ {y_i =d}\right\} \) represents an indicator function. When \(y_i =d\) is true, the result of the formula is 1, otherwise the result is 0.

In addition, the selection of optimizer is also an important task. This paper chooses the algorithm Adam (Graves 2012) which optimizes the algorithm SGD. Different parameters in Adam algorithm can adapt appropriate learning rate. It uses a momentum-like attenuation method.

Assuming that each parameter \(\theta _i\) in the model uses the same learning rate \(\eta \) and gradient \(g_t\) of the objective function parameter \(\theta _i\):

$$\begin{aligned} m_t= & {} \beta _1 m_{t-1} +(1-\beta _1 )g_t \end{aligned}$$
(21)
$$\begin{aligned} v_t= & {} \beta _2 v_{t-1} +(1-\beta _2 )g_t^2 \end{aligned}$$
(22)

where \(\beta _1, \beta _2\) are the decay rates, \(m_t\) represents weighted average variance, \(v_t\) represents weighted deviation variance. \(m_t\) and \(v_t\) are set to zero. However, they have always been close to zero during the process, especially when \(\beta _1\) and \(\beta _2\) are close to 1. To solve this problem, we have made a deviation correction to \(m_t\) and \(v_t\):

$$\begin{aligned} m_t^{\prime }= & {} \frac{m_t}{1-\beta _1^t} \end{aligned}$$
(23)
$$\begin{aligned} v_t^{\prime }= & {} \frac{v_t}{1-\beta _2^t} \end{aligned}$$
(24)

The renewal equation is as follows:

$$\begin{aligned} \theta _{t+1} =\theta _t -\frac{\eta }{\sqrt{v_t^{\prime }}+\varepsilon }m_t^{\prime } \end{aligned}$$
(25)

Kingma pointed out that Adam performed better in deviation correction (Kingma and Ba 2015), because it was much sparser in the convergence process.

4.5 Evaluation criteria

The strict, loose macro, and loose micro are used in this section to evaluate performances. We denote the true set of types as \(T_i\) and the prediction set as \(T_i^{\prime }\). N is the number of instances. The three ways of computing P (precision)/ R (recall) are listed as follows:

(1) Strict

$$\begin{aligned} P=R=\frac{1}{N}\sum _{i=1}^N {\delta \left( T_i^{\prime }=T_i\right) } \end{aligned}$$
(26)

(2) Loose macro

$$\begin{aligned} P= & {} \frac{1}{N}\sum _{i=1}^N {\frac{|{T}'_i \cap T_i |}{\left| {T_i^{\prime }}\right| }} \end{aligned}$$
(27)
$$\begin{aligned} R= & {} \frac{1}{N}\sum _{i=1}^N {\frac{|{T}'_i \cap T_i |}{\left| {T_i}\right| }} \end{aligned}$$
(28)

(3) Loose micro

$$\begin{aligned} P= & {} \frac{\sum _{i=1}^N {\left| {T_i^{\prime }\cap T_i}\right| }}{\sum _{i=1}^N {\left| {T_i^{\prime }}\right| }} \end{aligned}$$
(29)
$$\begin{aligned} R= & {} \frac{\sum _{i=1}^N {\left| {T_i^{\prime }\cap T_i}\right| }}{\sum _{i=1}^N {\left| {T_i}\right| }} \end{aligned}$$
(30)

Then, we can compute the F score:

$$\begin{aligned} F=\frac{2\times P\times R}{P+R} \end{aligned}$$
(31)

4.6 Experiment results

The experiment results of our model on the FIGER dataset are compared with the results of other baseline methods as shown in Table 7 and Fig. 8.

Table 7 Results on FIGER
Fig. 8
figure 8

Results on FIGER

Table 8 Results on OntoNotes
Fig. 9
figure 9

Results on OntoNotes

From Table 7 and Fig. 8, we observe that the results on FIGER are improved compared with the results on previous method mentioned in Chapter 2. Then, we can see the results of our model applied into OntoNotes which are shown in Table 8 and Fig. 9.

From Table 8 and Fig. 9, we can see that the results of our model on the OntoNotes are improved compared with the results of the model proposed by (Shimaoka et al. 2017). Combining with the results on FIGER, we can claim that the cause is due to the difference of datasets.

To sum up, experiment results verify the effectiveness of our model for fine-grained entity type classification.

5 Conclusions and future work

In this paper, existing type classification methods were analyzed, and we proposed a fine-grained type classification method with global information and sliding window context. We added dynamic global information on the basis of traditional type classification method. The global information is acquired by using the automatic summarization technique to remove the redundancy information. Then we fed it into neural networks to improve the accuracy of type prediction. In addition, different to the fixed window context-based methods in previous work, we proposed an adaptively window adjusting method which could locate the context information by finding the structure of the sentence structure. Finally, we demonstrated in the experiment that the performance of our model is improved when compared with others. The strict, loose macro and loose micro of our model on OntoNotes reached 52.47, 71.42 and 65.35, respectively, which is better than FIGER+PLE, K-WASABIE and Attentive + Manually crafted by comprehensive comparison of the results.

In the future, there are several directions worth exploring. First, we hope to explore more entity types by distant supervision where open text can be used to complete the fine-grained type classification. Second, we would like to figure out whether our method can recognize more than one entity while they appear in the same one sentence.