1 Introduction

Biomedical named entity recognition (NER) is the key component in biomedical text mining, which automatically recognizes and extracts biomedical entities (e.g., genes, proteins, chemicals, and diseases) from the text. It also serves as a primary step for some higher level task such as relation extraction, and knowledge base completion. The task of NER in the biomedical text is more complex and inherently more challenging compared to other domains because of the following reasons:

  • Long wordforms Biomedical NEs are, in general, very long and complex. For example: “myeloid lineage-associated cytokine receptors”

  • Nested wordforms Biomedical names have nested structures, such as:

    <RNA DNA>CIITA\(</\)DNA>mRNA\(</\)RNA> Here, “CIITA” represents DNA and the entire string “CIITA mRNA” represents RNA.

  • Presence of symbols and punctuations inside the named entities (NEs) complicates the process of their identification. For example, “5-lipoxygenase”, “CD28”, “INTERLEUKIN-2(IL-2)”.

  • Inconsistent naming convention Due to unavailability of standard biomedical nomenclature system, a biological entity can have different name variations. For example: “IL-2” can have several wordforms such as “IL2”, “Interleukin 2”, or “interleukin-2”.

  • Ambiguity Same entity with similar orthographic properties can belong to the different classes. For example, “IL-2” can be protein as well as DNA depending upon the contexts.

With the availability of the annotated data, significant advancement is observed in the data-driven techniques (supervised learning) for biomedical named entity recognition (BNER). Generally, the success of supervised systems is dependent primarily on the volume of annotated data and the feature set utilized. In the past, different classification models are developed on a large set of features after studying the properties of the data. Also without the prior knowledge, it is difficult to determine the usefulness of each feature. The large number of relevant, irrelevant, and redundant features in the dataset increases the search space that eventually leads to the problem of “the curse of dimensionality” [18].

Feature selection is a useful preprocessing step that helps in improving the performance of a classification system by selecting an optimal subset of features [42] by eliminating/reducing irrelevant and redundant features. In general, there are two standard approaches for feature selection, viz.filter and wrapper [42]. In the filter-based method, the assessment of feature subset is carried out by utilizing some statistical measures [38, 53] without incorporating any learning/classification algorithm. In contrast, wrapper-based methods require a learning algorithm apriori to explore the feature space [38].

The existence of complex interaction among various features makes the feature selection task more challenging. It might be the case that some features relevant to one class can be irrelevant to the other features. Therefore, an optimal feature subset should be a group of complementary features that span over the diverse properties of the classes. To address this problem, a variety of feature selection techniques have been developed, such as sequential forward selection [29], sequential backward elimination [29], recursive feature elimination (RFE) [22]. However, these approaches suffer from the problem of getting stuck at local optima. Some of the works in the literature [58, 59] shows that evolutionary computational technique can achieve notable performance for their global search ability.

Motivated by the strength of evolutionary algorithm, in this paper, we propose a generic feature selection technique utilizing the particle swarm optimization (PSO) [10] technique to assist the BNER models in the supervised learning framework. Particle swarm optimization (PSO) is a popular bionic algorithm where a set of randomly generated solutions move in the search space to obtain the optimal solutions. The working principle of PSO is simple (converge quickly) and requires very less number of parameters when compared with other optimization techniques like evolutionary algorithms [9, 12].

The proposed approach operates in three steps: first, we define the goodness measures utilizing the concepts of information theory, such as normalized mutual information [53], entropy [7], mutual information [38], to quantify the quality of a feature subset. The formulation of objective functions is based on the idea to explore statistical properties of the feature subset. In the second step, the search capability of PSO is explored to select the best feature subset exploiting the defined objective functions. Finally, popular sequence learning technique, namely conditional random field (CRF) [32] is used for building a model using the set of optimized features obtained through the PSO-based approach. The efficacy of the proposed approach is reported on four biomedical corpora, namely GENIA, GENETAG, AIMed, and BC-II. Experimental results show that we obtain reasonable accuracy with a pruned feature set. The models developed using this reduced feature set have less complexities, and hence can be used for developing real-time applications. Our evaluation shows that compared to the conventional baseline, the use of PSO-based feature selection yields the improvements of 5.56, 2.73, and 1.98 and 2.93 F score points on GENIA, GENETAG, and AIMed and BC-II dataset, respectively. The analysis reveals that the objective function utilizing the concept of normalized mutual information is the most effective in identifying the optimal feature set.

We summarize below the main contributions of our work:

  • Development of an efficient feature selection technique based on the PSO framework to assist the BNER task.

  • Formulation of seven distinct information theoretic-based objective measures to exploit statistical properties of feature subset.

  • Evaluation of proposed PSO-based feature selection technique on four benchmark corpora such as GENIA, GENETAG, AIMed, and BC-II for BNER task and its comparison with state-of-the-art techniques.

The remainder of the paper is structured as follows: Sect. 2 puts light on the recent works on NER task on biomedical text. In Sect. 3, we describe the proposed approach in details. Section 4 studies the computational complexity of the system. Section 5 reports experimental results and analysis. In Sect. 6, we present the comparative analysis with state-of-the-art techniques. Error analysis is shown in Sect. 7. Finally, we conclude the paper in Sect. 8.

2 Related works

The potential applicability and necessity of the NER problems have drawn the interest of many researchers in recognizing different biomedical entities like gene- and protein-related names from the available biomedical text including several shared task challenges [26, 48]. In general, we can categorize the existing approaches into three classes:

2.1 Lexicon or dictionary-based NER approach

A dictionary-based approach is viewed as the simple and naive technique of extracting entity mentions in text. Existing studies reveal the technique to have a high degree of precision with the very low recall. The system developed by Friedrich et al. [17] utilizes the dictionary-based features which allocates token classes, based on the presence of token in the dictionary. The low recall value is attributed to the spelling mistake, word-level, and character level variations. The major problem with the utilization of dictionaries is that it is infeasible to maintain the huge list of the entities, and furthermore there is a continuous growth in biomedical resource. The low recall and the other attributed problems in dictionary-based method have led to introducing several enhancements to these approaches. Some methods have enhanced the traditional dictionary-based technique by exploiting inexact matching approach, where the extended dictionary is used further for exact matching against text [24]. Despite the efforts, utilizing a dictionary reduces the scope of the system on any new entity.

2.2 Rule-based NER approach

Here, rules are defined to capture entities following some patterns and context of named entities. Rule-based method is shown to perform better compared to the dictionary-based approach. However, it is both tedious and difficult to frame each possible rule as those are handcrafted. AbGene [49] is one of the most successful rule-based NER systems to identify gene and protein names from biomedical literature with a precision of \( (87.00)\%\) and recall of \(67.00\%\) on 56,000 Medline abstracts. They used Brill POS tagger learned on 7000 manually tagged biomedical sentences. Further, postprocessing was performed in the form of hand-generated rules based on lexical-statistical properties. However, the proposed system suffers from some limitations, for instance, it could miss single word gene name that occurs without contextual gene term. Moreover, they identified that the heuristics to detect an invalid combination of a compound term is not perfect. EDGAR [41] system extracts cancer relevant drug and gene information from the biomedical literature. It takes input from the parser that uses the semantic and syntactic information from the Unified Medical Language System Metathesaurus (UMLS) [5] to extract factual assertions from the text. Further, they used stochastic POS tagger to enhance syntactic parser. This approach has limited portability and as such it fails to give a comparable performance on the other domains. Another popular tool, MetaMap [2] developed by National Library of Medicine (NLM) discovers UMLS Metathesaurus reflected in the text with the precision of \(85.00\%\) and recall of \(78.00\%\). Despite this high precision, system drastically fails for the ambiguous cases. The ProMiner [24] is also a rule-based system that extracts multiword names. It explores preprocessed synonym dictionary to extract potential name occurrences from the biomedical text which associates protein and gene database identifiers with the extracted matches. The system was evaluated on the BioCreAtIVE challenge dataset. The system obtained \(80.00\%\)F score value on the organisms mouse and fly datasets. On organism yeast, the system achieved 90.00% F score value. Bedmar et al. [45] perform drug name recognition and classification in biomedical texts. They utilized UMLS-MetaMap Transfer (MMTx) program information and defined nomenclature rules approved by the World Health Organization (WHO) to extract and classify pharmaceutical substances. The system also suffers from the ambiguity issue. The analysis revealed that the system was unable to capture genes from different organisms which are present in one abstract. Moreover, the system was unable to disambiguate because of the missing synonyms.

2.3 Supervised machine learning-based approach

With the availability of annotated data, major advancement is seen in the data-driven method for identifying the biomedical named entities (NEs). The release of biomedical benchmark corpora such as GENIA (derived from the MEDLINE) have led to the rapid advancement in biomedical text mining. Toward this end, several supervised machine learning models such as hidden Markov models (HMM) [3], conditional random fields (CRF) [32] and support vector machines (SVM) [6] have been exploited. Several shared task challenges such as BioNLP/NLPBA 2004 also used GENIA dataset in the challenge to extract the entities where a total of 9 systems was submitted.

Some of the popular state-of-art systems on this dataset include [19, 36, 46], where SVM and CRF were used as the popular base classifiers, respectively. In general, CRF is a popular classifier for any sequence labeling task such as Named Entity Recognition [33, 61]. Zhou et al. [66] trained an HMM on a feature set such as lexical, syntactic features (i.e., prefix, suffix) and word normalization feature on GENIA dataset. Further, they included SVM to solve the data sparsity problem. Besides the lexical and syntactic features, they explored the alias and cascaded entities by utilizing the closed dictionary from the training corpus and the open dictionary from SwissProt and the alias list LocusLink. Their system reported the F score value of \(72.55\%\). The model highly relies on the several postprocessing operation, domain-specific dictionary features which are not generic enough if evaluated on other biomedical corpus. System proposed by Settles [46] was also evaluated on the BioNLP/NLPBA 2004 shared task (GENIA) dataset. The system was trained on CRF using orthographic feature set and the semantic domain knowledge in the form of lexicon and achieved the F score value of \(69.50\%\). This work concluded that the orthographic feature set is highly effective in capturing the entities. However, this system fails in correctly predicting RNA and cell line-type entities. It was observed that mostly misclassification was due to the low frequency terms in the corpus. System proposed by [39] used HMM-based models considering only Part-of-Speech (PoS) as feature. They achieved the F score value of \(65.70\%\). The main drawback of this system is that it is solely dependent on PoS feature which is unable to capture all the lexical and semantic variations of biomedical text. Furthermore, the HMM model suffers from the label bias problem. Kim et al. [27] proposed two-phase system consisting of term boundary detection and semantic labeling. They used CRF and maximum entropy classifier, respectively, for boundary detection and semantic labeling. Moreover, they used finite state method to define the postprocessing rules to refine their proposed framework. The system reported 71.19% F score without domain-specific knowledge such as Gazetteers or abbreviation handling process. Finkel et al.  [15] proposed NER system on GENIA dataset. They used ME as a base classifier and achieved F score of 70.06%. However, the system made use of a number of external resources, including gazetteers, web-querying, use of the surrounding abstract, abbreviation handling, and frequency counts from BNC corpus. Finkel et al. [16] further explored GENETAG dataset for BNER. They reported the F score value of 82.2%. McDonald et al.  [35] employed orthographic feature set with other gene and biological term lexicons. They achieved a precision of \(86.40\%\) and recall of \(78.70\%\). The system was identified to suffer from the boundary detection problem. Kinoshita et al. [28] proposed a system that achieved a F score of \(80.90\%\) with dictionary-based preprocessing and HMM-based PoS tagger. The SVM-based system [36]) utilized gene/protein name dictionary as the domain knowledge. It reported F score of \(78.09\%\). These systems highly rely on external knowledge sources which fails to show reasonable performance when tested on other biomedical domain. An ensemble method developed in [55] made use of HMM, SVM and reported the F score of 82.58%. This system also utilized several external resources in the form of gazetteers to provide evidential clues. Some of the feature selection-based BNER models are available at [11, 44]. Recently, in [43] authors proposed genetic algorithm (GA)-based classifier ensemble technique for identifying the biomedical entities from the texts. They employed CRF and SVM as a classifier by varying the feature combinations. Finally, different models are combined/ensemble using the genetic algorithm-based classifier ensemble technique in an efficient way. They evaluated their proposed model on JNLPBA 2004 and GENETAG datasets and obtained the F score values of \(75.97\%\) and \(95.90\%\), respectively. However, the system was highly complex both in time & space when compared to PSO-based feature selection.

Danger et al.  [8] studied the protein interaction corpus to extract 12 protein relevant entities. The system is divided into two sub-modules: a dictionary lookup which searches for some entities in the text that can be associated with a relatively stable set of interaction terms; while the second module utilizes CRF classifier to search for the entity that cannot be described through a dictionary. They too evaluated their system on JNLPBA 2004 corpus reporting the F score value as \(76.13\%\). The system highly relies on the dictionary-based domain-specific features and thus it is highly domain dependent. Moreover, the system was found to be confused in identifying the RNA-proteins and cell line-cell-type entities pair because of the boundary detection problem. The system also fails to capture the entity when it has strong overlaps with other entities.

In the study conducted by [64], the stepwise unsupervised solution was proposed to perform entity extraction. They utilized UMLS meta-thesaurus to collect the seed terms for each target entity class by their semantic types, semantic groups, or specific concepts. Further, they detect the boundaries by leveraging the concept of noun phrases. In their final step, all identified candidate entities are provided as the input to the classifier to predict their semantic category. They evaluated their system on i2b2 and GENIA dataset with micro F score of \(53.1\%\) and \(39.5\%\), respectively. The major limitation of this system is the requirement of the large collection of test dataset in order to generate signature or semantics group.

PSO-based feature selection has also been popular in other domains like sentiment analysis [21], face recognition [40], spam detection [65], etc.

Neural network-based approach In recent years, neural network models have gained their popularity for solving problems in several domains [14, 20, 62, 63]. Recently, some studied have been conducted to explore deep neural network methods for entity extraction from the biomedical corpus and clinical text (Electronic Medical Records) [30, 57]. These approaches surpass the role of the manual feature engineering. The study performed by Tang et al. [50, 51], Yadav et al. [60] shows that the addition of word representation features in the traditional feature set can help in improving the system performance. The system was observed to be highly efficient than machine learning-based BNER techniques.

3 Proposed approach

In this section, we present our proposed technique for feature selection utilizing the concept of information theoretic measures and PSO. We begin by first formulating the feature selection problem. Followed by that, we discuss the feature engineering phase and the proposed information theoretic-based objective functions. Finally, we present the details of our PSO-based feature selection algorithm.

3.1 Problem formulation

Given, the set of features size n: \({\mathscr {F}} = \{f_1, f_2 \ldots f_n\}\), a classifier C and classification performance metrics M. The feature selection problem states:

“Extract the optimal feature subset \({\mathscr {F}}^{'} \subseteq {\mathscr {F}}\) such that classification performance metrics M can be maximized on classifier C.”

3.2 Feature engineering for BNER

We design diverse sets of features by analyzing biomedical texts. Descriptions of these features are presented below:

  • Local contextual feature: Contextual information provides an important clue to identify the vicinity of the current word. For example, the words ‘receptor’, ‘factor’, and ‘protein’ if occurring in the local context will provide evidence in determining the protein class, and ‘gene’, ‘promoter’, and ‘motif’ are clues for classifying DNA.

    Context can be represented as \(w_{i-1}^{i+1}= w^{i-1}\ldots . w^{i+1}\) where \(w_i\) denotes the focus word. Here, we capture the contextual information from \(w_{i-5}^{i+5}\) (i.e., preceding 5 and succeeding 5 words).

  • Word affixes: Functional affixes are used to indicate the syntactic function which provides information in capturing the very important clues for terminology identification. We use four length word affixes as the features.

    For example, for the term ‘receptor’; the prefix is ‘rece’ and suffix is ‘ptor’.

  • Part-of-Speech (PoS) information: Mostly biomedical NEs are noun phases, so capturing Part-of-Speech (PoS) information (extracted fromFootnote 1) provides important evidence in the identification of the NEs. Here, we use the PoS information of the current token as the feature, e.g., ‘IL-2’, ‘NF-Kappa’, and ‘v-Abl’.

  • Chunk-type information: This feature helps in properly identifying the boundary of the entity. We use the chunk information of the current and/or surrounding token(s). We use GENIA tagger v2.0.2 to extract the chunk information.

  • Dynamic NE tag(s): This feature helps in capturing the intermediate token of NE phase by providing the better evidence of current token to be intermediate NE. It represents the output tags \(t_{i-3}t_{i-2}t_{i-1}\) of the word \(w_{i-3}w_{i-2}w_{i-1}\) preceding \(w_i\) in the sequence \(w_1^n\). Here, we use the bi-gram template which takes the union of both present and preceding output class.

  • Word normalization: We form two different variants for this feature. The first feature captures the words with the plural form, alphanumeric character, digit, hyphen, and verb which enable in transforming the word to its root form. The other form of the feature specifies the orthographic construction of the current token. It is defined as the word shape feature. For each word, the normalized form is implemented by converting the uppercase by ‘A’, lowercase by ‘a’ and digit by ‘0’. For example, if the token is ‘NF-kappa-10’, the normalized form will be ‘AA-aaaaa-00’. This again is further squeezed to produce the word form of ‘A-a-0’.

  • Word length: It is observed that length of NE is generally longer compared to the other words in the text. This is based on the assumption that short words contain less information. For example, ‘and’, ‘of’, ‘for’ do not add any meaning to training. This binary feature is set as 1, when the length of word \(>\,4\); otherwise, the feature value is 0.

  • Infrequent word: The words which occur more frequently have less probability of occurring as the NEs. We design the binary feature, that set 1, when the frequency of the occurrence of the current word is greater than 10; otherwise, the feature value is 0.

  • Head noun: It represents the noun phrase which describes the functional property of the NE. Head noun phrase provides useful information in classifying them as NEs, when compared to remaining NEs words. We created a lexicon of 912 head noun, which occurs most frequently from the training dataset. This binary feature can have feature value 1 or 0, depending upon whether the word is present (1) or absent (0) in the lexicon. We have listed some examples of head nouns in Table 1.

  • Verb trigger: There are certain action verbs (e.g., inhibit, binds, interact) that help in recognizing the presence of NEs. We created a lexicon of a verb from the training dataset. We define a binary-valued feature that checks whether the current word appears in the list or not.

  • Informative NE information: Usually, biomedical NEs have longer wordforms that often contain many common words which in actual do not belong to the NEs. For example, the nominals and the function words which frequently appear in the training dataset but are usually not effective in capturing the NEs. Generally, biomedical NEs contain common symbols, punctuations, common words, functional word which often are longer in length. Considering this, we developed a list of multiword NEs. We have eliminated digits and symbol from the list as these provide very less clue in the identification of the NEs.

    A weight is defined to identify the important word in NEs as: \(NE_{wt}(t_{i})\) to be calculated as follows:

    $$\begin{aligned} NE_{wt}(t_{i}) = \frac{\text {Frequency of}~t_{i}~\text {as part of NE}}{\text {Frequency of}~t_{i}~\text {in the training set}} \end{aligned}$$
    (1)

    Finally, the most effective words were selected on the basis of two parameters: NEweight and frequency of words. We choose the threshold value of these two parameters by performing experiments on the validation set. We adopt the similar strategy as proposed by Yadav et al. [58]), to define the 5 distinct classes.

  • Orthographic features: Based on the constructions we define various orthographic features. Generally, it is observed that in biomedical NE, special characters like (‘,’, ‘-’, ‘. ’, ‘_’) are generally prominent. Therefore, these symbols provide an important clue in identifying the NEs.

    We define several features such as: InitialCapital, DigitAlpha, DigitOnly, Hyphen, CapsAndDigit, StopWord, AlphaDigit, AllCapitals, CapitalMixAlphabet, AlphaDigitAlpha, LowMixAlphabet. We provide an example for the orthographic features in Table 2.

Table 1 Examples of the head nouns

3.3 Proposed objective functions

This section discusses about various objective functions that we propose based on the information theoretic measures such as Entropy, Mutual Information, Correlation, Information Gain. In total, we define seven objective functions which are proved to be very useful in feature selection.

  • Objective function-1 We exploit correlation coefficient-based informative measure to derive an objective function. It follows the property: “good feature subsets contain features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.” [23]. Correlation-based Feature Selection (CFS) evaluates a feature subset by assessing the predictive ability of each feature individually and also its degree of redundancy. Moreover, it also introduces the ‘heuristic merit’ for a feature subset instead of individual feature independently [56]. Considering this, we compute our first objective function as:

    $$\begin{aligned} f_1(.)= \frac{\sum _{i=1}^M \rho _{ic}}{\sum _{i=1}^M \sum _{j=i+1}^M \rho _{ij}} \end{aligned}$$
    (2)

    where \(\rho _{ic}\) is the correlation coefficient between feature i and the class label c and \(\rho _{ij}\) is the correlation coefficient between features, i and j.

  • Objective function-2 Motivated by the study of Hall [23], we used the objective function utilizing the concept of correlation between the features and the classes. The features having low correlation values with the class are the irrelevant features, and hence can be ignored. We compute the objective function 2 as follows:

    $$\begin{aligned} f_2(.)= \frac{k\overline{r_{cf}}}{\sqrt{k+k(k-1)\overline{r_{ff}}}} \end{aligned}$$
    (3)

    where \(f_2(.)\) denotes the heuristic value, ‘k’ represents the feature set selected from the feature subset S. \(\overline{r_{cf}}\) represents the correlation between feature and class. \(\overline{r_{ff}}\) denotes the average correlation between two features. In the above equation, numerator indicates how a subset of features predicts the class. The denominator quantifies the amount of redundancy that exists between the features.

    It chooses the prominent set of features that provides enough clue to distinguish a class from the other. It repeatedly draws a subset of features and, based on its neighbors, it assigns higher weight to the feature that helps to discriminate it from the neighbors of a different class. This helps in determining the (near)-optimal feature subset.

  • Objective function-3 Let, the actual feature set be represented by F. The feature F is considered to be selected if the value of the particular feature in the solution-vector is predicted as 1 by the feature selection approach. Otherwise, the feature is treated to be not selected. In this way, we categorize the feature subsets into two groups: selected feature subset represented by SF and non-selected feature subset denoted as NSF. These subsets should satisfy the following properties:

    Table 2 Example of orthographic, morphologic, and prefix suffix features
    1. 1.

      \( F \in SF \cup NSF \)

    2. 2.

      \( SF \cap NSF \in \varPhi \)

    On the basis of these two feature subsets SF and NSF, we define three objective functions [4] as \(\varPhi _{1}, \varPhi _{2}, \varPhi _{3}\). Here \(\varPhi _{1}\) represents the average dissimilarity of the selected features. \(\varPhi _{1}\) can be stated as the average normalized mutual information between all the probable pairs of features selected. Less value of \(\varPhi _{1}\) symbolizes that the selected features are more non-redundant with respect to each other. \(\varPhi _{1}\) can be represented as follows:

    $$\begin{aligned} \varPhi _{1} = \sum _{f_i,f_j \in SF,f_i\ne f_j} \frac{2NMI(f_i,f_j)}{|SF|(|SF|-1)} \end{aligned}$$
    (4)

    where, \(NMI(f_i,f_j)\) is the normalized mutual information between two features, \(f_i\) and \(f_j\).

    Similarly, the similarity among the non-selected features is represented by \(\varPhi _{2}\) and it is defined as the average normalized mutual information among the probable pairs of non-selected features. \(\varPhi _{2}\) can be represented as follows:

    $$\begin{aligned} \varPhi _{2} = \sum _{f_i \in NSF, f_j\in SF, f_j=1NN(f_i)} \frac{NMI(f_i,f_j)}{NSF} \end{aligned}$$
    (5)

    Here, 1NN(f) gives the first nearest neighbor feature where feature \(f\in NSF\). \(\varPhi _{3}\) represents the average standard deviation of the selected feature set. As our main motive is to make this objective function very robust, we try to maximize the two objective functions \(\varPhi _{2}\) and \(\varPhi _{3}\) while minimizing the value of \(\varPhi _{1}\). In order to satisfy this, we combine these objective functions into a single objective function \(f_3(.)\) represented as follows:

    $$\begin{aligned} f_3(.) = \varPhi _{3}(\varPhi _{2}- \varPhi _{1}) \end{aligned}$$
    (6)

    The main aim is to increase the value of \(f_3(.)\). The higher value of this objective function leads to the better feature subset selection.

  • Objective function-4 This objective function utilizes all these three properties of \(\varPhi _{1}\) (minimized), \(\varPhi _{2}\) (maximized) and \(\varPhi _{3}\) (maximized) and is represented as follows:

    $$\begin{aligned} f_4(.) = \varPhi _{3}\varPhi _{2}- \varPhi _{1} \end{aligned}$$
    (7)

    The values of \(\varPhi _{2}\) and \(\varPhi _{3}\) should be maximized and the value of \(\varPhi _{1}\) should be minimized in order to find out the optimal feature set.

  • Objective function-5 We experiment with one more objective function which is a slight variation of the previous objective function. Here, we did not consider \(\varPhi _{3}\). \(\varPhi _{2}\) should be maximized and \(\varPhi _{1}\) should be minimized. We derive a new objective function which is represented as follows:

    $$\begin{aligned} f_5(.) = \varPhi _{2}- \varPhi _{1} \end{aligned}$$
    (8)
  • Objective function-6 We use one more variant of the above objective function taking into the view of increasing the similarity between the non-selected features, \(\varPhi _{2}\), and decreasing the dissimilarity value of the selected feature set, \(\varPhi _{1}\). Greater the value of this objective function, better is the feature subset obtained. This objective function is represented as follows:

    $$\begin{aligned} f_6(.) = \frac{\varPhi _{2}}{\varPhi _{1}} \end{aligned}$$
    (9)
  • Objective function-7 We also make an objective function (information gain ratio) by leveraging the concept of information gain. It can be defined as the gain over entropy. It enhances the information gain as it provides a normalized score for each feature contribution in a classification decision. The gain ratio is utilized as a disparity measure and the high gain ratio for the selected feature implies that the feature will be useful for classification. It is derived as follows:

    $$\begin{aligned} f_7(.)= \frac{\text {Information Gain}}{\text {Entropy}} \end{aligned}$$
    (10)

3.4 Overview of particle swarm optimization

Particle swarm optimization (PSO) [10] is a stochastic population-based optimization strategy inspired by the social behavior of birds to search for the optimal path. It is a meta-heuristic model where swarm (set of particle) traverse in the search space with some velocity to obtain the best set of solutions. Every particle specifies a solution to the optimization problem.

The algorithm of feature selection that we devise here is founded on the principle of a binary version of PSO [25]. In binary PSO, each particle’s position vector is represented by the binary value, i.e., 0 or 1. With the successive generations, the particle updates its position and moves toward the best solution in the search space. The overall process comprises of four steps:

  1. 1.

    Initialization of the population (swarm);

  2. 2.

    Updation of the particle’s global best position and self-best position;

  3. 3.

    Updation of velocity vector;

  4. 4.

    Generation of new particles.

figure d

Below different steps of the proposed approach are described in detail.

3.4.1 Initialization of the population

Initially, each particle is encoded as the fixed length binary-valued string \(\overrightarrow{P}(i) = (p(i,1), p(i,2),\ldots , p(i,n))\), where \( p(i,j) \in (0,1), i =1, 2,\ldots , N\), where N is the number of particles and \( j = 1, 2, \ldots , n\), where n represents the number of features (particle dimension). The values for different bit positions \(p_{(i,j)}\) of \(\overrightarrow{P}(i)\) are initialized as presented in line no. 6–11 of Algorithm 1. Every particle is associated with its velocity vector \(\overrightarrow{V}(i)= (v_{(i,1)}, v_{(i,2)}, \ldots , v_{(i,n)})\). Initially, we randomly set the value of the velocity vector between \((-\,1,1)\) (line no. 11 of Algorithm 1). In order to obtain optimal solution, each particle also keeps track of the two variables:

  1. 1.

    Personal best position (\(\overrightarrow{B}(i))\) which represents the best solution attained by the particle so far.

  2. 2.

    Global best position (\(\overrightarrow{G}\)) which keeps track of the best solution of entire swarm.

Initially, both \(\overrightarrow{B}(i))\) and \(\overrightarrow{G}\) are set to NULL.

3.4.2 Evaluation of goodness measure

Each PSO-selected feature subset is evaluated on the informational theoretic-based objective measures (c.f. Sect. 3.3). We have denoted the evaluation of goodness measure with the \(fitness(particle\_position, obj\_func, train\_data)\). The evaluation of a particle’s goodness depends on the objective function (\(obj\_func\)) and the training dataset (features), which is calculated in line no. 6 and 9 of Algorithm 2.

3.4.3 Updation of the global and personal best positions

The personal best position \(\overrightarrow{B}(i)\) of particle i is updated when the particle obtains a position \(\overrightarrow{P}(i)\), for which the fitness value is greater than the current \(\overrightarrow{B}(i)\). Similarly, the global best position \(\overrightarrow{G}\) is updated based on the following:

$$\begin{aligned} fitness(\overrightarrow{B}(i), f_{k}(.), F_{train})>fitness(\overrightarrow{G}, f_{k}(.), F_{train}) \end{aligned}$$

It is depicted in line no. (6–11) of Algorithm 2.

figure e
figure f
figure g

3.4.4 Updation of the velocity vector

In binary PSO, the particle’s velocity plays a crucial role in guiding the particle to traverse in the search space to get close to the possible solution of the target problem by updating its position. Each particle update its velocity according to the line no. 16 of Algorithm 2. Inertia weight ( \(0< w<\) 1) is set by the user to control the velocity explosion. \(\phi _1\) and \(\phi _2\) denote the learning parameter fixed by the user. For updating the velocity, the same strategy is followed.

3.4.5 Selection of the new particles

For each particle i with dimension j, the new particle \(p_{i,j}\) can be either 0 or 1 according to the following equation:

$$\begin{aligned} p_{(i,j)} = \left\{ \begin{array}{l l} 1 &{} \quad if~(rand (0,1) < S(v_{(i,j)}) ) \\ 0 &{} \quad \text {otherwise} \end{array} \right. \end{aligned}$$

S(.) denote the Sigmoid function. The selection of the particle is depicted in line no. (18–22) of the Algorithm 2.

In Algorithm 4, we provide the overall technique of our proposed PSO-based feature selection. The optimal feature set is selected based on the final global best position \(\overrightarrow{G}\). This process is described in Algorithm 3.

4 Computational complexity analysis

The total computation cost required to run the PSO-based framework using the individual objective function can be computed in three steps. The very first computation cost \(\mathcal {O}(n^2)\) is to evaluate the objective function. The next computation cost is to run the PSO algorithm and finally the complexity to build the model using CRF-based classifier has to be considered. Consider, for \(N=\) no. of particles, \(n=\) no. of features, \(I=\) no. of iterations, the average no. of bits selected for a given particle is \(F_{avg}\) . For the S size of training samples and L label set

The total cost of the proposed framework will be \(\mathcal {O}((F_{avg}* S^2*L^2)+(I*N*n^2))\) in our case, \(S>> n\), i.e., the size of training set, S, is much larger than the total feature set n. Therefore, the dominant computational complexity comes from the CRF algorithm. But this is required in case of any classification model built using CRF. Moreover, the cost of building the model using the full features set is (\(\mathcal {O}(n* S^2*L^2)\)) which is much higher than the cost of building the model using the reduced number of features obtained by the proposed feature selection technique (\(\mathcal {O}((F_{avg}* S^2*L^2))\)). Using the proposed filter-based feature selection model complexity has reduced significantly as compared to the wrapper-based feature selection model. Thus, the gain in terms of time complexity is significant.

5 Datasets, experiments, and analysis

This section provides an overview of datasets, evaluation scheme, experimental results and thorough analysis of the results. The task is formulated as a sequence labeling problem, and we use CRF [32] for the experiments. For our implementation, we use a \(C{++}\) based \(CRF{++}\) package.Footnote 2 To properly denote the boundaries of NE, we follow the BIO encoding scheme, where B, I and O denote the beginning, intermediate and outside of NE tokens, respectively.

5.1 Datasets

We perform our experiments on four biomedical benchmark datasets, namely GENIA version 3.02 corpus, GENETAG, AIMed, and BC-II. GENIA corpus was derived from MEDLINE using the MeSH terms such as ‘human’, ‘blood cells’ and ‘transcription factors’. Datasets used while training & testing have been extracted from the GENIA version 3.02 corpus.

The NE types to be identified from this dataset are ‘DNA’, ‘RNA’, ‘Cell line’, ‘Cell-type’, ‘Protein’.

The AIMed corpus contains protein–protein interaction information. The aim here is to identify and classify the entity as NE of type ‘Protein’. This dataset is developed from Database of Interacting Protein (DIP), containing a total of 197 abstracts.

The other dataset we explored was GENETAG dataset derived from ‘MedTag’. It consists of correct and incorrect genes and protein names. For the evaluation of our approach on GENETAG datasets, we consider the dataset available at.Footnote 3 Genes described in the datasets were annotated with ‘NEWGENE’ tag and the overlapping genes are annotated by another term, namely ‘NEWGENE1’.

The dataset consists of total 20,000 sentences with the gene/protein mention, where 7500, 2500, and 5000 sentences are used for training, validation, and test set.

We also evaluate our proposed approach on widely adopted BioCreative II GENE Mention dataset. The dataset consists of MEDLINE abstracts which were manually annotated for gene mention. The dataset consists of total 15, 000 sentences, where 7500 sentences are used as a training set, 2500 sentences as validation set, and 5000 sentences were utilized for the testing.

5.2 Evaluation scheme

We evaluate our systems on precision, recall, and F score which are defined as follows:

$$\begin{aligned} precision= & {} \frac{|\{\text {Ground truth NE chunks}\} \cap \{\text {System predicted NE chunks}\}|}{|\{\text {System predicted NE chunks}\}|} \end{aligned}$$
(11)
$$\begin{aligned} recall= & {} \frac{|\{\text {Ground truth NE chunks}\} \cap \{\text {System predicted NE chunks}\}|}{|\{\text {Ground truth NE chunks}\}|} \end{aligned}$$
(12)

F score metric value is calculated on the precision and recall values as follows:

$$\begin{aligned} F_{\beta } = \frac{(1 + \beta ^{2})(recall * precision)}{\beta ^2 * precision + recall} \end{aligned}$$
(13)

Here \(\beta = 1\), we consider the script available here,Footnote 4 for the evaluation on all three datasets. This script is the updated form of CONLL-2003 shared task [54] evaluation script. The script generates three different types of F measures according to the matching of exact, right and left boundary. For evaluating BC-II dataset, we used the same script as released by Biocreative shared task.Footnote 5

5.3 Experimental results

We define the following two baseline systems to compare our proposed feature selection approach.

  • Baseline 1 This baseline model is developed by training CRF with the full feature set as discussed in subsection 3.7.

  • Baseline 2 This baseline model is built by manually varying the contextual feature combination.Footnote 6

Results of these two baselines are reported in Table 3. For our rest of the experiments, we have performed threefold cross-validation on the training data to set the values of the PSO parameters. We present in Table 4 the values of different parameters used in the experiments.

Table 3 Results of baseline systems on all the biomedical datasets

Thereafter, we evaluate our proposed approach on these four datasets.

Table 4 Results of the proposed approach with different PSO parameter settings on validation dataset
Table 5 Results of the proposed approach on the defined objective functions for each dataset
Table 6 Optimal feature selected through the proposed PSO-based framework on all the datasets

5.3.1 Analysis of results

We carried out deep analysis of the results both in terms of the performance (F score) (c.f. Table 5) and the number of features selected for all the datasets (c.f. Table 6). The obtained results (Table 5) show that our proposed approach significantly outperforms the baseline systems. The objective functions \(f_{3}(.)\), \(f_{4}(.)\), \(f_{5}(.)\), and \(f_{6}(.)\) which are based on the concept of normalized mutual information outperform in terms of F score value on the GENIA, AIMed, GENETAG, and BC-II datasets, respectively, compared to the objective functions based on the correlation (\({f}_{{1}}(.)\), \({f}_{{2}}(.)\)) and information gain (\({f}_{{7}}(.)\)). Moreover, in terms of the number of features selected, objective function \(f_{2}(.)\) outperforms other objective functions by selecting 19, 5, 13, 15 features from GENIA, AIMED, GENETAG, and BC-II dataset, respectively. From the evaluation results, we observe the followings:

  • The feature subset having a high NMI with respect to the target output is likely to reduce the uncertainty on the values taken by the output.

  • NMI-based objective functions were able to detect the nonlinear relationships between the variables while the correlation derived objective functions are restricted to only linear dependencies.

  • NMI-based objective function is found to be more relevant for the jointly redundant or relevant features which make univariate criteria useless.

  • The objective functions formulated on correlation and information gain allow overestimation of the relevance of some features.

  • As NMI derived objective measures is the KL distance between the joint density and the product of the individual densities. Therefore, NMI can measure non-monotonic relationships and other more complicated relationships, when compared to correlation-based measures.

From the obtained results and the above-mentioned claims, NMI derived objective function\(f_{6}(.)\)can be selected among 7 proposed objective functions. Though the use of\(f_{6}(.)\)does not lead to the highest performance for all the datasets but the obtained results by\(f_{6}(.)\)as the objective function are consistent and are very near to the optimal solution. Note that\(f_{5}(.)\)and\(f_{6}(.)\)are two standard ways of combining two functions,\(\phi _1\)and\(\phi _2\). In order to select a good subset of feature values, \(\phi _1\)should be minimized and\(\phi _2\)should be maximized. Our ultimate goal is to select the optimal feature set by maximizing the given objective functions. The objective function\(f_{5}(.)\)can lead to zero value (\(\phi _2 \approx \phi _1\)) and can also have negligible effect (\(\phi _2 \gg \phi _1\)),but this is not the problem with objective function, \(f_{6}(.)\). Because of these reasons, we obtain consistent performance by\(f_{6}(.)\)over various datasets.

5.4 Statistical significance test

We carried out the statistical significance test on the obtained results (through each objective functions) in the PSO-based framework. For different datasets and for each objective function, experiments were executed for 20 independent runs and we utilized t statistic to examine the obtained results. Table 7 shows the t test results, where p value is the probability of the improvement to occur by chance. For all the datasets, the obtained p value is less than 0.05. It verifies that the model is statistically significant and is unlikely that the improvement in classification accuracy happened by chance.

Table 7 p Value for different objective functions on each datasets

6 Comparative analysis with state-of-the-art techniques

In this section, we present the comparisons of our proposed NER system with the existing state-of-the-art techniques. We compare our system with both domain-dependent and -independent techniques. For our experiments, we have considered only PoS and chunk information as the domain-dependent features. As evident from Table 8, our system almost outperforms existing state-of-the-art techniques. However, our system was unable to beat Danger et al. [8]. They have adapted supervised machine learning techniques using handcrafted features, which cover both domain-dependent as well as domain-independent features. In their first phase, system was learned on CRF with the prominent domain-independent feature set such as word shape, brief word shape; affixes; PoS; and chunk information. The obtained F score with this phase was reported as \(71.29\%\). In the second phase, they incorporated some domain-dependent features (cell line lexicon) with other handcrafted features such as the DNA sequence, head noun, distance to the head noun, roman, Greek, leading to F score improvement by 5.16 points. In contrast, our proposed system was generated using only few PSO-selected features which are domain independent in nature and the system also does not rely on any gazetteer. The success of Danger et al. system is attributed to the use of the cell-lexicon whose removal leads to obtaining \(71.29\%\)F score.

Table 8 Comparisons with state-of-the-art techniques: GENIA dataset

We further compare our system with the existing techniques on the GENETAG, AIMed, and BC-II datasets as reported in Tables 910, and 11, respectively. The proposed NER system utilizing PSO-based feature selection outperforms the other state-of-art systems on all the datasets except AIMed. On AIMed dataset, our system lacks by 3.13 F score points to [13]) system. This is because of different parameter setting (population size). [13]) used genetic algorithm-based framework on the population size of 200. In contrast, a very less population size (20) has been used in our proposed framework. We also performed the experiment with the same population size as 20 for GA, and achieved the comparable performance (90.35 F score).

Table 9 Comparisons with state-of-the-art techniques: GENETAG dataset
Table 10 Comparisons with state-of-the-art techniques: AIMed dataset
Table 11 Comparisons with state-of-the-art techniques: BC-II

7 Error analysis

Here, we analyze the results to get an idea of the possible errors. We made the following observations:

  • Boundary detection problem We observed that our system on GENIA and GENETAG datasets suffer from the problem of the boundary detection. The classifier is largely confused among the classes: ‘I-PROTEIN’ and ‘B-PROTEIN’, where 164 instances were wrongly classified. The misclassifications of NEs to “Other-than-NE” class amounts to 1305 instances. For the GENETAG dataset, we analyze that majority of the classes were wrongly predicted as ‘I-NEWGENE’. In total, 487 instances were misclassified as ‘I-NEWGENE’.

    The main cause of boundary detection problem is due to the descriptive naming convention, especially in case of the entity type Protein”. One of the misclassified examples is “T cell activation-specific enhancer, where the boundary was not detected properly. It is even hard for the biologist to identify that the descriptive words like “normal, activation” would be a part of entity.

    NE disambiguation, often, is a problem for improper identification of the NEs. For example, the NE “T3 binding sites” is a protein term and was ambiguated by the NE term “DNA” which is not acceptable. The system had difficulties in identifying NEs containing parentheses.

  • Short words This error was mostly predominant in AIMed. These were mainly misclassified as a part of NEs. The probable reason behind this might be that in training data, many short words appear in the training as part of the NEs, and our model fails in identifying the context. It was also observed that our system lacks in identifying the instances with lowercase or symbols which are therefore tagged as “Other-than-NE”.

  • Acronyms These words either refer to the non-gene entity acronyms or some value. For example, ‘HAT’ is the abbreviation for the entity name “hepatic artery thrombosis” but actually it was referring to “histone acetyl transfer”, a non-entity name. Errors were encountered due to the false negative cases, where gene names in the test set were not known. The classifier lacks in knowledge and sufficient contextual clues.

8 Conclusion

In this paper, we have proposed a novel filter-based method for feature selection using information theoretical concepts for solving NER task in multiple biomedical corpora. In particular, we have used PSO as an optimization technique and determine the best fitting feature set for the problems. The main focus of this paper was to compare the several information theory-based objective functions in the PSO-based feature selection framework. To this end, we have defined seven goodness measures that were highly effective in identifying features relevant for solving the NER problems in biomedical domain. We have exploited the concept of normalized mutual information, gain ratio, and correlation to design our objective functions.

We have evaluated the proposed technique on multiple biomedical corpora. As a base classifier, we have used CRF. Experimental results show that our models achieve good performance levels for all the datasets without using heavy domain-specific resources and/or tools. The obtained results by the proposed method are as good as the state of the arts, and the most appealing characteristic of the proposed method is that we are able to reduce the complexity of the model significantly by minimizing the use of features. Comparisons among several objective functions reveal that information gain-based metric is very helpful in determining the best subset of features.

Correlation as the objective function was observed to be significant in improving the accuracy. In future, we would like to build a feature selection technique by optimizing all the objective functions simultaneously using the concept well known as multiobjective optimization.