1 Introduction

With the widespread application of high-throughput techniques and the burst of gene and protein analysis, the number of biomedical literatures is growing at an exponential speed. Moreover, benefiting from the Open Access, collections of manuscripts, ranging from very general and highly distributed ones to very specific and localized ones, are publicly available. For instance, the PubMed Central literature database contains over 3.3 million references to full-text journal papers, covering a wide range of biomedical fields. Owing to the large number of literatures, it is of great difficulty for biologists to keep up with the new development of this research area, even in a very specialized area such as gene regulation and protein structure prediction. Therefore, effective management of large amount of information and the accurate knowledge extraction from large volume literatures becomes much more vital. Considering manual annotation with biomedical experts is time-consuming and expensive, it is urgent to develop an automatic text mining method, which may help the biologists and doctors to well organize and structure these materials.

As one of the fundamental biomedical text mining tasks, Named Entity Recognition (NER), aiming at identifying chunks of text referring to specific entities of interest, plays a key role in disease-treatment relation extraction [1], gene function identification [2] and semantic relation extraction between concepts in a molecular biology ontology [3]. In general domain, such as newswire domain, the task of named entity recognition is to recognize the name of places, persons, organizations [4]. However, in biomedical domain, biologists and doctors pay much more attention to the entities like genes, proteins, DNA, RNA and so on. Recently, several attempts have been performed to transform existing named entity recognition systems in general domain into biomedical area [58]. However, due to the non-standard nomenclature in biomedical research, few of them achieved satisfactory performance, thus biomedical named entity recognition (Bio-NER) continues to be a challenging task.

Comparisons of naming conventions between biomedical and newswire domain has already been discussed in [914], which is summarized as follows: (1) naming an entity descriptively raises great difficulty to identify the entity names’ boundaries. For example, “specific immunoglobulin E” or “immediate-early gene” is named with multiple words. Zhou et al. found that nearly 18.6 % of biomedical entity names in the GENIA V3.0 corpus contained at least four words [9]. Figure 1 depicts the results. (2) There are conjunctions and disjunctions. Two or more entity names may share the same prefix noun by using conjunction or disjunction; for example, “mouse and human U6 DNA” indicates two entity names, which are “mouse U6 DNA” and “human U6 DNA”, respectively. In GENIA V3.0, about 2.06 % of biomedical entity names fall into this form [9]. (3) There are no strict naming conventions in biomedical literatures. The capitalization and hyphen are to some extent casually used, e.g. Cholesterol, 5-Cholesten-3beta-ol and (3beta)-cholest-5-en-3-ol is the same chemical substance. The non-standardized names may result in low Recall and coverage of dictionaries [10, 11]. (4) Massive amount of abbreviations. Plenty of entities in biomedical domain have abbreviated names. The abbreviations could lead to ambiguity, which makes it difficult to classify them against the existing dictionary. For example, ‘TCF’ may refer to ‘Tcell Factor’ or ‘Tissue Culture Fluid’ in different articles. Chang et al. have shown that, in MEDLINE abstracts, 42.8 % of abstracts have at least one abbreviation and 23.7 % of abstracts have two or more [12]. Liu et al. showed that 81.2 % of the abbreviations are blurred and each abbreviation has 16.6 senses in MEDLINE abstracts on average [13]. (5) Cascaded construction. It is common to find that one biomedical entity name is embedded in another entity name. In [9], Zhou et al. pointed out that 16.57 % of biomedical entity names have such cascaded construction in GENIA V3.0.

Fig. 1
figure 1

Distribution of the number of words in biomedical entity names (GENIA V3.0) [9]

In short, the entity names in biomedical domain are much more complex than those in the general domain (such as newswire). However, it is a crucial step to explore more evidential features and effective methods to extract knowledge from biomedical literatures. In this paper, an introduction to three fundamental models of the Bio-NER, which are Dictionary-based, Rule-based and Machine Learning based models, is firstly given. We then introduce six effective Bio-NER tools, drawing a comparison between programming languages, features used, underlying models and post-processing techniques etc. Subsequently, we present the corpora used, the evaluation criteria and the results. Ground on analysis, we finally put forward the suggestions for biologist, doctors and computer scientists about the selection of Bio-NER tools.

2 Biomedical named entity recognition models

Due to the complex naming conventions and its priority in biomedical domain, several Biomedical Named Entity Recognition (Bio-NER) systems have been developed to recognize the entities in biomedical texts. The models used in these Bio-NER systems fall into three categories, they are Rule-based methods, Dictionary-based methods and Machine Learning based methods.

2.1 Dictionary-based methods

Dictionaries are large collections of names, serving as entries for a specific entity class. Matching entries exactly against text is simple and precise, but it gives the way to low recall. To solve this problem, the user can either use incomplete matching techniques, or fuzzify the dictionary by generating typical spelling alternatives for each entry automatically. Compared with the rapid increasing amount of biomedical literatures, it becomes impossible to construct a dictionary that can cover all categories of different entities. Thus, it is impractical to imply dictionary-based methods to achieve high F-Score. However, dictionary-based methods can be integrated with other Bio-NER tools, which can improve the accuracy of the hybrid algorithm. For example, Tsuruoka and Tsujii et al. annotated proteins in GENIA V3.01 with a combination of dictionary and Naive Bayes, achieving an F-score of 66.6 % [15]. Yang et al. improves the recognition performance through the bio-entity name dictionary expansion, including Pre-keyword and Post-keyword expansion, POS expansion, merge of adjacent bio-entity names and the exploitation of the contextual cues [16].

2.2 Rule-based methods

Rule-based model employs plenty of rules to separate different classes. Handcrafted rules are used to describe the composition of named entities and their context in early rule-based systems. For example, Fukuda et al. employed surface clues (capital letters, symbols, digits) to extract candidates for protein names [10]. Though these rule-based methods seemed promising initially, they failed to perform on larger datasets. For example when Proux et al. evaluated their performance on a larger corpus of 25,000 MEDLINE abstracts by sampling, the precision fell to 70 % [17]. Moreover, it is impossible for these systems to identify new named entities that never discovered before and cost a lot to discover new classes of entity.

2.3 Machine learning based methods

Machine Learning based Bio-NER model integrates various complex steps to incorporate different processing procedures, it performs better than the other two genre solutions [14]. With machine learning based algorithms, researchers do not have to compose the complex rules manually. In addition, these algorithms can also identify new named entities and classes excluded in standard dictionaries. In [11], Nobata et al. first experimented with three identification and two classification methods to recognize ten entity classes, including protein, DNA, RNA, cell type etc., achieving an F-measure between 58.98 and 66.24 % on 100 annotated MEDLINE abstracts using decision trees. Most of the machine learning models can be generally categorized as based on Supported Vector Machine (SVM), Hidden Markov Model (HMM) or Conditional Random Fields (CRFs). Considering the time-consuming of SVM, the SVM-based tools are not compared in this paper although they perform very well in classification and regression.

2.3.1 HMM based methods

Hidden Markov Model (HMM) is a generative type of sequence-based model [18]. Suppose x refers to the input token sequence and y is the output tag sequence. Generative models find the best tag sequence by computing the probability p(x, y). In HMM model, the probability of p(y|x) can be represented as a calculation utilizing its generative form p(x, y) according to the Bayes rules:

$$p\left( {y|x} \right) = \frac{{p\left( {x,y} \right) }}{p\left( x \right)}$$
(1)

Assuming the current tag y i depends on the previous tag y i1 , and the current token x i depends on the current tag y i , then p(x, y) turns to be:

$$p(x,y) = \mathop \prod \limits_{i = 1}^{n} p(y_{i - 1} |y_{i} ) *\mathop \prod \limits_{i = 1}^{n} p(x_{i} |y_{i} )$$
(2)

where n is the number of tokens in x. Because the objective function is to find the best p(y|x), and p(x) is a priori probability that remains the same for each possible tag class, it only needs to compare the probability of p(x, y).

To solve the data sparseness problem caused by p(x i |y i ) in Eq. (2), sufficient training data are required for every possible value of x i in order to calculate p(x i |y i ). However, in reality, the training data used to compute accurate probabilities is not enough when decoding new corpus. This problem is often solved using the Naïve Bayes. The decomposition of p(x i |y i ) is as follows:

$$p(x_{i} |y_{i} ) = \mathop \prod \limits_{j} p(f_{ij} |y_{i} )$$
(3)

where f ij is the value of the jth feature of x i .

Even with the above solution, HMMs suffer from another two limitations. The first one derives from the Naïve Bayes assumption against standard NER rules, which would benefit from a richer representation of observations in terms of many overlapping features, such as capitalization, affixes, part-of-speech (POS) tags, and surface word features. However, these features depend on each other, which violate the Naïve Bayes assumption. The second problem with HMM is that it sets its parameters to maximize the likelihood of the observation sequence, but the task is to predict the state sequence. Namely HMM inappropriately uses a generative joint model to solve a conditional problem [18].

2.3.2 CRFs based methods

Named entity recognition can be considered as a sequence segmentation problem which means each word is a token in a sequence to be assigned a label. Conditional Random Fields (CRFs) are undirected statistical graphical models, a special case of which is a linear chain that corresponds to a conditionally trained finite state machine. It is widely applied in many areas, including computer vision [19], shallow parsing [20] and biomedical named entity recognition [21]. Several famous tools such as NERSuite, Gimli, etc. are all based on CRF. Its mathematical model can be depicted as follows.

x denotes random variables over data sequences to be labeled, and y denotes the random labels over corresponding label sequences. In an undirected graph G = (V, E), a node v  V corresponding to each of the random variables representing an element y v of y. (y, x) is a conditional random fields when each random variable y v obeys the Markov property, which means p(y v |x, y w , w ≠ v) = p(y v |x, y w , w ~ v). During modeling sequences, the most common graph structure is that the nodes corresponding to elements of y from simple first-order chain, as illustrated in Fig. 2.

Fig. 2
figure 2

Graphical structure of a chain-structured CRF

A conditional model p(y|x), which is the probability of a particular label sequence y given observational sequence x can be defined as a normalized product of potential functions. A transition feature function of potential function is

$${\text{exp}}\left( {\mathop \sum \limits_{j} \lambda_{j} t_{j} (y_{i - 1} ,y_{i} ,x,i) + \mathop \sum \limits_{k} \mu_{k} s_{k} (y_{i} ,x,i)} \right)$$
(4)

where t j (y i1 ,y i , x, i) is a transition feature function of both the observation sequence, the labels at position i and i − 1 in the label sequence. s k (y i , x, i) is a state feature function of the label at position i and the observation sequence. λ j and μ k are parameters to be predicted from training data.

A set of real-valued features g(x, i) of the observation can be defined as feature functions in order to describe some characteristics of the empirical distribution of the training data. Below is an example:

$$g\left( {x,i} \right) = \left\{ {\begin{array}{*{20}c} 1 \\ 0 \\ \end{array} } \right.$$
(5)

When the current state (in the case of a state function) or previous and current states (in the case of a transition function) take on particular values, the feature function will take on the value of 1. The state function s(y i1 , y i , x, i) and transition function t(y i − 1 , y i , x, i) can be denoted with f i (y i1 , y i , x, i), thus the F j (y, x) can be defined as:

$$F_{j} \left( {y,x} \right) = \mathop \sum \limits_{i = 1}^{n} f_{i} \left( {y_{i - 1} ,y_{i} ,x,i} \right)$$
(6)

With the function F j (y, x), the probabilities of a label sequence y on the observation sequence x can be expressed as:

$$p\left( {y |x,\lambda } \right) = \frac{1}{Z(x)}\exp \left( {\mathop \sum \limits_{j} \lambda_{j} F_{j} \left( {y,x} \right)} \right)$$
(7)

where Z(x) is a normalization factor.

The conditional nature of CRF is its main advantage, which resulting in the relaxation of the independence assumptions required by HMMs in order to ensure tractable inference. Moreover, CRF is a discriminatively trained model for labeling and segmenting sequence. It also combines arbitrary, overlapping and agglomerative observation features from both the past and the future. CRF methods can benefit from efficient training and decoding based on dynamic programming. The parameter estimation guarantees that the global optimum can be found.

3 Comparisons on the Bio-NER tools

To deeply understand the current situation and future development of the Bio-NER tools, six excellent Bio-NER tools are introduced with the purpose of comparing their performance, which are ABNER [22], LingPipe [23], BANNER [24], NERSuite [25] and Gimli [26], GENIA Tagger [27]. Table 1 presents an overview of their characteristics, supported corpora, feature sets, mathematical models and post-processing techniques. Functions of the six tools overlap with each other, we are going to find out which tool performs best overall and which tool is suitable for specific entity type like DNA or RNA.

Table 1 Overview of the six Bio-NER tools

3.1 ABNER

A Biomedical Named Entity Recognizer (ABNER) is open source software which is capable of analyzing molecular biology text that can be used to recognize DNA, RNA, protein, Cell Line and Cell type. It employs conditional random fields with a variety of orthographic and contextual features. It has a presentative graphical interface and contains two modules for tagging entities (e.g. protein and cell line) trained on standard corpora.

This tool is written in Java and employs graphical window objects in the Swing library. The CRFs methods are implemented using a quasi-Newton method named as L-BFGS, which could help to find the optimal feature weights. ABNER employs a deterministic finite-state scanner using the Jlex tool for tokenization. It provides a Java application interface which allows users to incorporate ABNER into their own systems and train models on new corpora [22].

3.2 GENIA tagger

With optimization for biomedical text, such as MEDILINE abstracts, GENIA tagger functions well in biomedical domain. It’s a good option to extract information from biomedical documents because it is trained on three corpus, the Wall Street Journal corpus, the GENIA corpus and the PennBioIE corpus [28], respectively. The developers apply the bidirectional algorithms and achieved equivalent performance compared with other machine learning models. The algorithm finds the highest probability sequence and the corresponding decomposition structure in polynomial time among all the possible enumerated decomposition structures. The tagging result of GENIA Tagger contains five entities, which are protein, DNA, RNA, Cell Line and Cell type.

3.3 LingPipe

LingPipe is a tool kit for processing text using computational linguistics. It is originally used to find the names of people, organizations or locations in news. Its confidence-based chunkers are first-order hidden Markov methods with emission probabilities estimated by Character Language Models. Using a generalized form of best-first search over the lattice that produced by the forward–backward algorithm, these chunkers are able to iterate an arbitrary number of chunks in confidence-ranked order.

LingPipe’s architecture is efficient, scalable, reusable and robust. It also equips a Java API with source code and unit tests and could deal with multi-lingual, multi-domain and multi-genre corpus and mine new data for new tasks [23]. LingPipe contains a model trained on GENETAG [29] corpora, which makes it capable of recognize named entity like proteins, genes, etc. in biomedical text.

3.4 BANNER

BANNER is implemented in Java and based on CRFs model. It is designed to maximize domain independence by neither employing brittle semantic features nor rule-based models [24]. BANNER’s processing can be divided into three steps, which is depicted in Fig. 3. Raw sentences are tokenized, converted to features, and labeled. The Dragon toolkit [30] and Mallet [31] are used for part of the implementation. The stream of tokens is converted to features, each of which is a name pair. All of the information about the token is encapsulated by the set of features. The stream of features is labeled so that each token is given the corresponding label. BANNER can extract gene and protein entities from molecular biology text efficiently. BANNER flowchart is shown Fig. 3 [24]:

Fig. 3
figure 3

BANNER’s flowchart [24]

3.5 NERSuite

NERSuite is designed as a pipe-lined system to facilitate research experiments using the various combinations of different NLP applications [25]. It is written in C++ and contains a tokenizer, a modified version of the GENIA tagger and a named entity recognizer and each of them is an independent module. For a given text in sentence-bag model (each sentence as a vector) document file, NERSuite firstly split each sentence into tokens, and computes the detailed positions of each token. The modified GENIA tagger performs POS-tagging, lemmatization and chunking. Finally, with a pre-trained model [25] or user-trained model, NERSutie can deal with biomedical text containing DNA, RNA, protein, Cell Line or Cell type. Figure 4 shows the flowchart of NERSuite.

Fig. 4
figure 4

NERSuite’s flowchart [25]

3.6 Gimli

Gimli provides a trained and optimized model for recognition of biomedical entities like DNA, RNA, protein, Cell line and Cell type from scientific text. It is implemented with Java and can be used as a command line tool. It offers rich functionalities, including training new models, customization of the feature set and parameters’ adjustment through a configuration file. Gimli takes advantage of various publicly available tools and resources. The implementation of CRF is provided by MALLET. GDep is employed for tokenization and linguistic processing, i.e. lemmatization, POS tagging, chunking and dependency parsing [32]. In terms of lexical resources, it adopts BioThesaurus and BioLexicon as the resource for biomedical domain terms [33]. Recently, it becomes a state-of-the-art solution for biomedical NER, contributing to faster and better research results. Figure 5 shows the flowchart of Gimli, presenting the workflow of the required steps, tools and external resources [26].

Fig. 5
figure 5

Gimli’s flowchart [26]

4 Evaluation of the six tools

4.1 Datasets

To compare the BioNER tools in detail, we select GENETAG [29] and JNLPBA [34] to evaluate these six Bio-NER tools. These two benchmark datasets are most widely used in biomedical named entity recognition domain [26, 29, 34, 35].

4.1.1 GENETAG

GENETAG is composed of 20,000 sentences extracted from MEDLINE abstracts, not being focused on any specific domain [26]. It contains the annotations of proteins, DNAs and RNAs (grouped in only one semantic type). Experts of biochemistry, genetics and molecular biology provided the annotations. This corpus was used in the BioCreative II gene mention challenge [35], providing 15,000 sentences for training and 5000 sentences for testing.

4.1.2 JNLPBA

The JNLPBA corpus contains 2404 abstracts extracted from MEDLINE with the MeSH terms like “human”, “blood cell” and “transcription factor”. The manual annotation was based on five classes of the GENIA ontology, namely protein, DNA, RNA, Cell Line, and Cell type. It was used in BioNLP/NLPBA 2004 [34], providing 2000 abstracts for training and the remaining 404 abstracts for testing.

4.2 Evaluation measures

The evaluation was performed by comparing the six tools’ output, in terms of the precision (p), recall (r) and their harmonic mean, the F-measure. They are based on the number of true positives (TP), false positives (FP) and false negative (FN) returned by the system:

TP denotes the number of correctly found name entity chunks; FP denotes the number of found name entity chunks which do not exist in the corpus; FN denotes the number of found name entity chunks that are not found by the Bio-NER tools.

4.3 Performance analysis

GENETAG, having more heterogeneous annotations than JNLPBA, is not focused on any specific biomedical domain. In order to find which system is more appropriate for widely application in biology and medicine, we first conduct our experiment on GENETAG. The results are presented in Table 2 and the best results achieved by all the six tools are in boldface. As one of the tools, Dingare et al. combined Maximum Entropy Markov Model and limited Memory Quasi-Newton maximizier together to perform Named entity task. Moreover, they take advantage of Google web-querying technique, the TnT POS tagger, a gazetteer and other external resources to improve the overall performance of their system [36].

Table 2 Results obtained on GENETAG corpus

To sum up, Gimli performs better than all the other five tools on GENETAG corpus, achieving the accuracy of 90.22 % and F-measure of 87.17 %, which are 1.56 and 0.74 % improvement over the second best tool, BANNER, respectively. Compared with NERSuite, Gimli is of 1.72 % higher on F-measure. Although LingPipe ranks number five in all the six systems generally, it achieves a highest recall of 88.49 %. From Table 2 we can find that LingPipe is constructed on HMM method, and other four systems all constructed on CRFs method except Dingare’s tool. If we can modify the HMM method to compensate the LingPipe’s results on precision, it will be an effective method.

Compared with GENETAG, JNLPBA focused more on specific biomedical domain, thus a model trained on the JNLPBA corpus may provide annotations optimized for research on human blood cell transcription factors. JNLPBA splits various types into different semantic groups. Because LingPipe and BANNER do not support JNLPBA corpus, the comparison for results of ABNER, NERSuite, Gimli and GENIA Tagger are shown in Table 3 and the best results achieved by all the six tools are in boldface. Instead of supervised machine learning, Zhang et al. tackle the problem with a stepwise unsupervised solution. Their approach is independent from hand-built rules or examples of annotated entities, which makes it possible to adapt their system to different semantic categories and text genres easily [37].

Table 3 Results obtained from JNLPBA corpus

Except Zhang’s tool, it can be seen that the other five tools’ overall performance is similar to each other. For all the five categories of biomedical name entities, Gimli achieves the highest F-measure, which is 0.86 and 116 % higher than the other ones in top 3. For “Cell Type”, GENIA Tagger is 2.2 and 2.31 % higher than NERSuite and ABNER, separately. All the rest six performs poorly with no one higher than 60 %. Moreover, for “DNA” and “RNA” name entities, Gimli and Dingare’s tool performs best, repectively. However, all the systems’ performances range from 55 to 70 %.

The difference between NERSuite and Gimli lies on “Protein” and “Cell Line” comparisons. Gimli achieves 1.94 and 2.53 % higher performance than NERSuite correspondingly. Although Zhang’s tool doesnot perform well on JNLPBA corpus, it exceeds other tools considering generalization, which makes it a good option to deal with diverse group of text.

Because of the complex solutions that include the application of linguistic, lexicon features and the combination of various CRF methods, Gimli outperforms the other five tools and achieves the highest overall performance, again. All the six tools do not perform well in recognizing “Cell Line” names. None of them gets a result over 60 %, this bottom field apparently decreases the overall performance.

4.4 Speed analysis

Besides the comparison of F-measure, we regarded the processing speed as a critical evaluation criterion considering the burst amount of recently published literature. We recorded the tagging time when use these tools to tag 5000 sentences in a machine with two processing cores @ 3.20 GHz and 8 GB of RAM running Linux. The details for the speed of each tool can be found in Table 4. In our experiment, ABNER is the fastest system; however it performs worst on GENETAG corpus. BANNER is not as fast as ABNER, but it ranks first considering speed and F-measure, thus it is a suitable option for biomedical named entity recognition. Gimli can obtain highest F-measure on two corpuses, but it is the slowest. We can clearly find that the promotion of F-measure is obtained by a more sophisticated software framework and longer processing time.

Table 4 Tagging speed for each tool

In recent years, taking account of unconstrained growth in thesis and biomedical database, researchers are seeking for efficient methods to process and extract enormous information. Granted that there is no need for high F-measure and ABNER is used for analysis, the tagging speed is still relatively slow when it comes to Terabytes of data, let alone other tools. In order to catch up with the growing speed of literature, Tang et al. proposed a CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework, which reduces the model training time for large-scale training samples [38]. Li et al. also developed a parallel CRFs algorithm called MRCRF (MapReduce CRF) containing two parallel sub-algorithms, MRLB (MapReduce L-BFGS) and MRVtb (MapReduce Viterbi), respectively. They polish up the performance significantly without the overmuch decrease of correctness [39].

5 Conclusion

In our paper, we present a review on the research of biomedical named entity recognition, especially on the three fundamental models and six open source Bio-NER tools. It is clear that to identify and classify the named entities in biomedical literatures is an extremely sophisticated work. With the help of machine learning algorithms, now we can achieve a result much better than dictionary-based or rule-based methods. Gimli becomes one of the state-of-the-art solutions for biomedical named entity recognition. But facing different applications, the user still need to consider different tools. The suggestions are as follows:

  1. (a)

    For overall performance on common biomedical corpus, BANNER and Gimli can achieve satisfied results;

  2. (b)

    To exactly find out the “Cell Type” entities, GENIA Tagger or NERSuite will perform well;

  3. (c)

    Discovery on “DNA” and “RNA” names, we can choose NERSuite and Gimli;

  4. (d)

    For “Protein” name entity recognition, Gimli is the unique candidate.

  5. (e)

    Dingares’s tool is suitable for “Cell Line” recognition.

  6. (f)

    BANNER can perform NER task accurately with a relatively fast speed.

  7. (g)

    In order to cope with huge amount of data, we need more paralleled algorithms.

There is no doubt that Gimli could achieve the champion because it is newly proposed (see Table 1) and integrated more modules than the other tools.

We also find some issues where future research is likely to be concentrated. Most of current tools are developed focus on two corpora, GENETAG and JNLPBA. These tools may be tuned or modified to achieve better results on the two datasets. However, they may perform badly on real application. It will be much better if we can do research on the generalization ability or make the benchmark corpus update continually. In addition, from the experiments, we can find that machine learning methods integrated with biomedical dictionary (such as NERSuite and Gimli) may transcend the ones without. However, existing dictionary in real biomedical data may not be well coincide with the machine learning methods. A successful Bio-NER tool demands newly compiled biomedical dictionaries covering different research areas, though it is time-consuming and costly.