Multiscale Laplacian graph kernel combined with lexico-syntactic patterns for biomedical event extraction from literature

Abdulkadhar, Sabenabanu; Bhasuran, Balu; Natarajan, Jeyakumar

doi:10.1007/s10115-020-01514-8

Multiscale Laplacian graph kernel combined with lexico-syntactic patterns for biomedical event extraction from literature

Regular Paper
Published: 24 October 2020

Volume 63, pages 143–173, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Knowledge and Information Systems Aims and scope Submit manuscript

Multiscale Laplacian graph kernel combined with lexico-syntactic patterns for biomedical event extraction from literature

Download PDF

648 Accesses
8 Citations
Explore all metrics

Abstract

Bio-event extraction is an extensive research area in the field of biomedical text mining, this focuses on elaborating relationships between biomolecules and can provide various aspects of their nature. Bio-event extraction plays a vital role in biomedical literature mining applications such as biological network construction, pathway curation, and drug repurposing. Extracting biological events automatically is a difficult task because of the uncertainty and assortment of natural language processing such as negations and speculations, which provides further room for the development of feasible methodologies. This paper presents a hybrid approach that integrates an ensemble-learning framework by combining a Multiscale Laplacian Graph kernel and a feature-based linear kernel, using a pattern-matching engine to identify biomedical events with arguments. This graph-based kernel not only captures the topological relationships between the individual event nodes but also identifies the associations among the subgraphs for complex events. In addition, the lexico-syntactic patterns were used to automatically discover the semantic role of each word in the sentence. For performance evaluation, we used the gold standard corpora, namely BioNLP-ST (2009, 2011, and 2013) and GENIA-MK. Experimental results show that our approach achieved better performance than other state-of-the-art systems.

Exploiting graph kernels for high performance biomedical relation extraction

Article Open access 30 January 2018

Optimizing graph-based patterns to extract biomedical events from the literature

Article Open access 30 October 2015

Bio-molecular event extraction by integrating multiple event-extraction systems

Article 02 January 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Advances in both biological and computational methods act as the catalyst for a large number of publications, especially in the biomedical domain [1]. Life science research outputs are widely disseminated as scientific articles, which can act as a source for knowledge discovery [2]. Recently biomedical text mining applications are developed using this literature with a focus on biological and clinical domain areas such as screening of clinical trials, pharmacogenetics, reaction detection and repurposing of drugs [3].

Initial efforts on text mining in the biomedical domain had a major focus on fundamental tasks like categorizing bio-entities (genes, proteins, diseases, and drugs) and extracting binary relationships (protein–protein interaction, gene–disease associations and disease–disease associations) between the entities [4]. Extracting relations from biomedical literature is a significant task in the area of semantic mining of text [5]. Some of the recent relation extraction strategies applied to various biomedical problems such as protein–protein interactions (PPIs) [6], gene–disease associations [7], chemical-induced disease (CID) [8] and chemical–disease relation (CDR) [9]. Biomedical relation classification task focusing on PPI and drug–drug interaction [10] shows the importance and applications of relation extraction from the literature.

Following the success of the relation extraction task, the next focus is on to extract related biomolecular events from the text. In general, bio-event is the textual event specialized for the biomedical domain and dynamic bio-relation involving one or more participants, and these participants can be bio-entities or bio-events and are usually each assigned a semantic role like the theme and cause [11, 12]. Bio-event extraction can help us to understand certain biological processes such as pathway reconstruction [13], semantic search [14], association mining for knowledge discovery, and bioprocess extraction [15]. Automatically, extracting events from the biomedical text is a challenging task because of the uncertainty and assortment of NLP processing such as negations and speculations, which occur in the biological text and can lead to misunderstanding and incorrect interpretation [11, 12].

The bio-event extraction process consists of two common steps, trigger detection and argument detection. Identifying trigger words comprises the detection of event triggers and their types, as quantified by the selected ontology [11]. Argument detection, known as edge detection or event theme construction is the process of detecting arguments for the events. The arguments can be named entities (genes, proteins, diseases) or events represented by trigger words [11, 12, 16]. Consider the following example below.

Example:

PMCID: 1310901

Original Sentence:

Down-regulation of interferon regulatory factor 4 gene expression in leukemic cells.

Tagged Sentence:

<trigger>Down-regulation </trigger> of <theme>interferon regulatory factor 4 </theme><trigger>gene expression </trigger> in leukemic cells.

Here the trigger words ‘downregulation’ and ‘expression’ denote the two events - regulation and gene expression, and the gene ‘interferon regulatory factor 4’ is the theme representing the argument in the sentence.

There has been a wider acceptance of the notion that biomolecular events can play a crucial role in molecular mechanisms of diseases and can be linked with interactions in pathways and networks [12, 16]. Due to this and other various reasons, notable shared task community challenges BioNLP-ST (Biomedical Natural Language Processing Shared Task) in 2009 [16], 2011 [17], 2013 [18] and 2016 [19] were organized specifically in focus on biomolecular event extraction from the literature. The core problem in these tasks was the extraction of biomolecular events from standard datasets, which is based on the GENIA corpus [20]. The GENIA corpus enriched with domain-specific meta-knowledge and it was named as GENIA-MK (Meta-knowledge) corpus [21]. The GENIA-MK corpus contains human curated annotations of 9,372 sentences from 1000 abstracts in which 36,858 typed, complex and nested events were represented [21]. Recently Zerva et al. [22] proposed a hybrid approach combining a random forest with generic rule patterns, which uses dependency between trigger words and cues of the uncertainty events and achieved an F-Score of 88% in the GENIA-MK corpus.

1.1 Background

Different text-mining approaches have been developed utilizing techniques such as rule-based [23], dictionary based [24], machine learning [25], and hybrid approaches [26]. In particular, the Support Vector Machines algorithms with rule-based or dictionary-based approaches are widely used in extracting biomolecular events [27]. In spite of several existing approaches, the challenge is still open and leaves space for improvement. For example, pattern matching and dictionary-based approaches achieved moderate results in complex event extraction processes such as regulation, negative regulation, and positive regulation [11]. Machine learning based studies [25] employed different strategies such as kernel-based learning [28, 29], deep learning based [30,31,32], graph-based learning [33,34,35,36,37,38,39,40,41] and hybrid approaches [26] to extract the biomedical events efficiently.

Recently, the enriched graph-based features played an important role to extract the events from the text and created the best systems for the classification of biological events [42]. The advantage of using graph-based approaches for event extraction includes the use of structural properties of the sentence such as semantic and syntactic features, path features, and similarity features. This was briefly explained in the review [42]. Earlier, various graph-based approaches like subgraph mining [43,44,45], random walk [46], shortest paths [47], subgraph matching [39,40,41, 48] and hybrid methods [49] were introduced to extract the biomedical events from the literature.

Subgraph mining is the process of extracting the important concepts from the graph [43,44,45]. Random walk explains the path consists of random steps between one node (bio-entity) to another node (bio-event) in the graph [46]. The shortest path is the shortest optimized path between two nodes (entity and event) [47]. The graph matching techniques are utilized to find whether one text could be inferred from another by using the dependency parsing of the two texts [39]. Subgraph matching techniques are utilized to extract the maximum common subgraph between two graphs [39,40,41]. On the other side, kernel-based approaches integrated with graphs produced efficient results in relation extraction tasks [50, 51]. A graph kernel was generated using dependency parsing techniques in which each graph contains the dependency structure and the linear order of the words [52]. In this study, we employed a special graph kernel named Multiscale Laplacian Graph (MLG) kernel [53] integrated with the linear feature-based kernel to extract the biological events from the text. The MLG-Kernel was used to compare the structure of the graph at multiple different scales. The motivation behind employing MLG is that it not only captures the topological relationships between the individual event nodes but also identifies the topological relationships between the subgraphs [53]. The following section briefly describes state-of-the-art approaches for the task of biomedical event extraction.

1.2 Related work

Bjorne et al. [33] used n-gram features and shortest path syntactic dependencies between event arguments and rule-based graph pruning to extract the events and attained the F-score 51.95% in the BioNLP-ST-2009 task dataset. The disadvantage of this approach is the lowest trigger detection performance on the test set. In 2013, BioNLP-ST, Bjorne and Salakoski [34] presented an automated event extraction system named TEES 2.1. It is a machine learning based tool for extracting text bound graphs from natural language articles, they represent both binary relations and events with a unified graph format where named entities and triggers are nodes and relations and event arguments are edges and reported an F-score of 50.74%. The lack of using learning rules caused defects in the argument detection phase; for example, consider an event with multiple optional arguments, such as Cell differentiation from the CG task with 0–1 AtLoc argument and 0–1 Theme arguments. While it can be possible that such an event can exist without any arguments at all, it is often the case that at least one of the optional arguments must be present. Hakala et al. [35] used graph represented features including paths connecting nested events and the occurrence of a pair of entities such as gene, protein in general subgraphs mined from external PubMed and PMC abstracts reported the best F-score of 50.97% in BioNLP-ST-2013. The main limitation of this system is that it increases only precision not recall.

In BioNLP-ST-2011 Riedel et al. [36], extracted event arguments by scoring candidate subgraphs to rank event pairs and achieved the F-score of 57.46%. In this system, they employed stacking and the UMass model (trained model which consists of trigger labels, events arguments and protein pairs) to extract the events. Stacking led to better performance in this system but a combination of stacking with the UMass model caused slight variation in the performance on the test sets. McClosky [54] converted annotated event structure in the training data to an event dependency graph that takes entities (event arguments) as vertices and edges and attained the F-score of 50% in BioNLP-ST-2011. Riedel and McCallum [55] implemented stacking procedure and combined their approach with McClosky [54] extracted event arguments by scoring candidate subgraphs to rank event arguments and achieved the F-score of 56.05% in the BioNLP-ST-2011 dataset; the limitation of this approach is that it is harder to extract full text events.

Liu et al. [39, 40] implemented Exact Subgraph Matching and Approximate Subgraph Matching (ESM/ASM) approaches to extract the events from the literature efficiently. In their method, they applied ESM/ASM from sentence graphs to event graphs, employed a distance metric to every vertex of the subgraphs, and attained the F-score of 51.12% in the BioNLP-ST-2011 dataset. The lack of post-processing rules and inconsistencies in the gold annotation caused more false positives and false negatives in this system. Liu et al. [41] further improved their ESM/ASM based approach with the distributional similarity model (DSM), optimized graph features, and attained the F-score of 55.09% in BioNLP-ST- 2013. The limitation of this approach is low recall due to ‘Site’ entity recognition.

Apart from the above graph-based approaches, recently different classification approaches were also deployed to extract the biomedical events efficiently [30, 56,57,58]. Some of the notable works are discussed here. Munkhdalai et al. [56] proposed a new semi-supervised learning method which was named self-training in significance space (STSS) to solve the imbalanced data problem and attained the F-score of 54.30% in BioNLP-ST-2011.The system performance is lower in terms of F-measure because of the computational requirements. Wang et al. [30] presented a multiple distributed representation method which combines dependent context formed by word embedding with task-based features from biomedical text and fed it to deep learning models and achieved the F-scores 59.94%, 55.20%, and 50.12% in BioNLP-ST-2009, 2011, 2013 datasets, respectively; this method still needs manually designed features, which limits the power of generalization. Li et al. [57] used an optimization method named dual decomposition method along with dependency parse based rich features, unsupervised word features and extracted the events with F-scores 56.09% and 53.19% in BioNLP-ST- 2009, 2013. Recently, Wang et al. [58] implemented a Bidirectional Long Short Term Memory (Bidirectional-LSTM) approach for event extraction on Multi-Level Event Extraction (MLEE) corpus. Furthermore, for generalizing their approach they used BioNLP-ST-2009, 2011, 2013 corpora and achieved the F-scores more than 60% in the development set.

There is an increasing importance for biomolecular event applications and the current trends in biomedical relation extraction tasks, which uses ensemble learning methods and graph-based approaches [33, 42]. The motivations of our work integrate a Multiscale Laplacian Graph (MLG) kernel with a feature kernel as an ensemble model for the event extraction task. The challenge of the current study was the extraction of complex events using subgraph mining thereby gaining a deeper understanding of the biomolecular events. Kondor and Pan [53] first introduced MLG, and it was used to compare the structure in graphs simultaneously at multiple different scales. The objective of employing MLG in our event extraction is that it not only captures the topological relationships between the individual event nodes but also identifies the associations among the subgraphs for complex events.

The rest of the paper is organized as follows; Sect. 2 details the proposed materials and methods with a complete overview of the MLG model used in this study. Section 3 depicts the results and discussion followed by conclusions and future perspectives in Sect. 4.

2 Methods

The event extraction system presented in this study has three subtasks, namely (i) text preprocessing, (ii) event identification and (iii) argument detection. In text pre-processing, we applied general steps such as text preparation and cleaning, recognition of gene and protein mentions, dependency parsing of event sentences. In the event identification phase, we used two kernels, namely, a baseline feature-based kernel which uses token-based features, sentence-based features, parsing features, domain-specific features and the Multiscale Laplacian Graph (MLG) kernel, which uses the multilevel topological relationships between the event nodes as features. Both the feature-based kernel and the MLG kernel were combined using ensemble SVM for event identification. Finally, in the argument detection phase, we used lexico-syntactic patterns to detect arguments of the events. The overall schematic architecture of our event extraction pipeline has been depicted in Fig. 1 and each subtask is described detail in the following subsections.

In our methodology, we considered the nine most crucial events from BioNLP-ST [16,17,18], which are commonly used in existing studies. The nine types of events are merged into three main classes. The first five (Gene Expression, Transcription, Protein catabolism, Phosphorylation, Localization) had only one argument (theme: protein) and these events are called simple events. The second class of binding events involved more than one argument (two themes: proteins). Finally, the regulated events (Regulation, Positive regulation, Negative regulation) had two arguments: a theme and cause (event or protein).

2.1 Text pre-processing

2.1.1 Text preparation and cleaning

With a specific end goal to set up the corpus for extracting the events from it, the following preprocessing steps were carried out. They consisted of tokenization, sentence segmentation, POS tagging, lemmatization, and chunking. OpenNLP [59] was utilized for sentence splitting, tokenization, POS tagging, and chunking. Lemmatization was done by BioLemmatizer [60].

2.1.2 Dependency parsing

To provide information about grammatical relationships concerning two words extracted from a graph representation of the dependency relations in a sentence, we applied dependency parsing. The advantage of using dependency parsing is to find the grammatical relationships between two words and to find out the syntactic representation of a given sentence. A dependency relation is formalized as a direct grammatical relationship including two words (headword and dependent word) and a sentence is represented as a graph of dependency relations [61]. Dependency related features played an important role to extract the biomedical events. Here, we used two dependency parsers: the Stanford Dependency Parser (SDP) [62] is used to compute the universal dependencies and the GENIA Dependency Parser (GDep) [63], for the generation of the dependency graph of the sentence. Figure 2 depicts the dependency parse for a simple sentence. Here we can see that binary relations between common nouns such as transcription, gene, activity with adjectives and prepositions like binding, in and c-jun were identified. The given sentence explains Leukotriene B4 stimulates the transcription of genes c-fos and c-jun and activity AP-1 binding in human monocytes. The dependency parser identified transcription, gene, Leukotriene, activity as NN (noun, singular), AP-1 as CD (cardinal number), and monocytes as NNS (noun, plural). The dependency parser also identified the grammatical relations within the sentence using amod (adjectival modifier), dobj (direct object), pobj (object of preposition), conj (conjuction), and prep (preposition).

2.1.3 Named entity recognition (NER)

The next step in our approach is the recognition of gene/protein mentions in the event sentences. To extract the events with high accuracy, named entities play an important role, since they came in the theme-cause role. NER is the process of detecting entities such as genes, proteins, diseases, species, RNA, cell, cell line from the text [64, 65]. BCC-NER [66], our in-house hybrid named entity tagger, was used to detect the gene and protein names automatically.

2.2 Event identification

Next, for event identification, we used an ensemble machine learning based classification approach with two kernels, namely feature-based kernel and MLG kernel. The feature-based kernel uses token-based features, sentence-based features, parsing features, and domain-specific features. The Multiscale Laplacian Graph Kernel (MLG) [53] uses the multilevel topological relationships between the event nodes as features. Both the feature-based kernel and the MLG kernel were combined using ensemble SVM [67] for event identification.

2.2.1 Feature-based kernel

In the baseline feature-based linear kernel, we used a total of 15 features broadly classified into four feature categories, namely token-based, sentence-based, parsing and domain-specific features which were employed successfully in a previous bio-event extraction task [68,69,70]. All 15 features are category wise grouped and illustrated in Table 1. The detailed feature representations for generating feature-based kernel model are clearly explained in Supplementary file S3.

Table 1 Category wise features used in feature-based kernel

Full size table

2.2.2 Multiscale Laplacian Graph (MLG) kernel

Recently graph-based approaches for relation extraction are getting increased attention for their ability to capture both syntactic and semantic structures, thereby enabling deep understanding of the complex sentences such as bio-events and achieving state-of-the-art performances [41]. To improve the performance of the bio-event extraction task we employed the MLG kernel [53] along with the baseline feature-based kernel in our approach. The MLG kernel [53] is briefly introduced below and it is constructed based on two graph kernels, namely (i) Laplacian Graph kernel (LG), (ii) Feature space Laplacian Graph Kernel (FLG). The implementation of the MLG kernel is available at https://github.com/horacepan/MLGkernel.

Laplacian Graph (LG) Kernel: Consider graph $ G $ as the weighted undirected graph with vertex set $ V = \left\{ {v_{1} ,v_{2} \ldots v_{n} } \right\} $ and the edge set $ E $. The graph Laplacian [75] is a positive semi-definite matrix and it can be represented using adjacency matrix $ A $ and weighted degree matrix $ D $. The Laplacian matrix of the graph can be expressed using the notation $ L = D - A $.

The LG kernel of two graphs ($ G_{1} ,G_{2} $) can be defined by the following equation.

$$ k_{\text{LG}} \left( {G_{1} ,G_{2} } \right) = \frac{{\left| {\left( {\frac{1}{2}S_{1}^{ - 1} + \frac{1}{2}S_{2}^{ - 1} } \right)^{ - 1} } \right|^{1/2} }}{{\left| {S_{1} } \right|^{1/4} \left| {S_{2} } \right|^{1/4} }} $$

(1)

where S₁= $ L_{1}^{ - 1} + \lambda $ I, S₂= $ L_{2}^{ - 1} + \lambda $ I.

The $ L_{1}^{ - 1} $, $ L_{2}^{ - 1} $ are the inverse of the graph Laplacian and I is the identity matrix with parameter λ, these are used to obtain the similarity between the graphs $ G_{1} ,G_{2} $.

Feature Space Laplacian Graph kernel (FLG): FLG kernel was used to compare the structure of the subgraphs in a single scale. FLG unites the information attached to the vertices with the graph Laplacian. The advantage of employing the FLG kernel is to transform the vertex space variables $ a_{1} ,a_{2} \ldots a_{n} $ into feature space variables $ b_{1} ,b_{2} \ldots .b_{n} $, where $ b_{i} = \mathop \sum \nolimits_{j} t_{i,j} $($ a_{j} ) $ and each $ t_{i,j} $ only depend on j during local and reordering the invariant possessions of vertex v_j and the resulting kernel should be permutation invariant. Vertex space variables are the input variables that can be used to transform graph vertex as the feature vertex. Consider $ G_{1} ,G_{2} $ as the two graphs with regularized Laplacians $ L_{1} $ and $ L_{2} $, and we define the parameter λ ≥ 0 and (Φ₁,…,Φ_m) is a collection of m local vertex features and they define the feature mapping matrices in the FLG. The FLG kernel is defined as follows.

$$ k_{\text{FLG}} \left( {G_{1} ,G_{2} } \right) = \frac{{\left| {\left( {\frac{1}{2}S_{1}^{ - 1} + \frac{1}{2}S_{2}^{ - 1} } \right)^{ - 1} } \right|^{1/2} }}{{\left| {S_{1} } \right|^{1/4} \left| {S_{2} } \right|^{1/4} }} $$

(2)

where S₁= $ {\text{U}}_{1} L_{1}^{ - 1} U_{1}^{T} + $ λ I, S₂= $ U_{2} L_{2}^{ - 1} U_{2}^{T} + $ λ I

Here $ U_{1} $ and $ U_{2} $ are the feature mapping matrix, $ L_{1 } $ and $ L_{2} $ are the Laplacian matrix and I is the identity matrix with parameter λ and transpose $ U_{1}^{T} $,$ U_{2}^{T} . $ The major limitation of the FLG kernel is that it cannot consider graph structure at multiple different scales which paved the way for the MLG kernel. The FLG kernel acts as the key component in the MLG kernel and it is applied recursively for the construction of MLG.

Multiscale Laplacian Graph (MLG) Kernel: The MLG kernel for a graph (G) can be computed as follows:

(i)
The graph (G) is divided into a large number of smaller subgraphs, and the FLG kernel is computed between any two subgraphs for the similarity calculation in single scale.
(ii)
A new kernel (FLG) is calculated between the vertices by placing the extracted subgraphs to a random vertex of the graph G.
(iii)
Finally, a new FLG kernel is computed between the large subgraphs of the graph (G) based on step ii and this process is repeated L (multiple scales) times.

The MLG kernel thus constructed as follows:

Consider G as the graph with vertex set V, and compute the kernel k as a positive semi-definite kernel on the vertex set V. For each vertex ($ v) $ in the vertex set V ($ v \in V) $ we have a nested sequence of L neighborhoods.

$$ v \in N_{1 } \left( v \right) \subseteq N_{2} \left( v \right) \subseteq \cdots \subseteq N_{L} \left( v \right) \subseteq V $$

(3)

Consider $ G_{l} $($ v $) as the corresponding subgraph for each $ N_{l} $($ v $). From the above equation, the Multiscale Laplacian subgraph (MLS) kernel can be defined by calculating multiple FLG kernels for vertex set V as (k₁…k_L: $ {\text{V }} \times {\text{V }} \to {\text{R }} $).

$$ k_{1 } \left( {v,v^{\prime}} \right) = k_{\text{FLG}}^{k} \left( {G_{1} \left( v \right),G_{2} \left( {v^{\prime}} \right)} \right) $$

(4)

$ k_{1 } $ is the FLG kernel ($ k_{\text{FLG}}^{k} $) generated from the base kernel $ k $. Here, the base kernel $ k $ is used to boost the FLG to multi-scale kernel.

$$ k_{l } \left( {v,v^{\prime}} \right) = k_{\text{FLG}}^{{k_{l - 1} }} \left( {G_{l} \left( v \right),G_{l} \left( {v^{\prime}} \right)} \right) $$

(5)

where $ l $ = 2, 3…L, and $ k_{l } $ is generated from $ k_{l - 1} $ kernel.

Let G be a set of graphs as a chance to be an accumulation of graphs with the end goal that all their vertices are members of an abstract vertex space V supplied with a symmetric positive semi-definite kernel k : $ V \times V \to R $. Assume that the MLS kernels $ k_{1 } $,…,$ k_{ L} $ are characterized in Eqs. 4 and 5 both for pairs of subgraphs inside the same graph and crosswise over pairs of different graphs. Now the MLG kernel can be structured as follows

$$ k\left( {G_{1} ,G_{2} } \right) = k_{\text{FLG}}^{\text{LG}} \left( {G_{1} ,G_{2} } \right) $$

(6)

In this study to implement the MLG kernel, we generated Universal Dependencies (UD) along with the adjacency matrix of the bio-event sentences.

Universal dependencies: We applied Stanford parser for generating UD of the sentences [62]. The grammatical relations of UD are described in a hierarchy, rooted in the most generic relation dependent. In this study, we applied UD in all event sentences to extract the typed relation across the sentence, especially with trigger words and entities.

Adjacency matrix: The generated UD of biomedical event sentences was used to create an adjacency matrix, to represent the association between words. An example UD generated and corresponding adjacency matrix for a sample sentence (PMCID: 1310901) is shown in Fig. 3a, b, respectively.

Subgraph mining

In the MLG kernel, the subgraph mining process was essential to scale the event sentences at multiple levels. The aim of this graph kernel is to find the local structures that are critical at specific position of the graph and find global property that roughly summarizes the graph. In order to do so, MLG kernel is defined as a graph kernel that can consider structure at multiple scales, by comparing graphs by subgraphs recursively. The underlying procedure is that, two graphs are compared by subgraphs, in the next iteration two subgraphs are compared by smaller subgraphs and so on. The MLG kernel uses node features to capture the global structure and induced feature vectors by similarity scores for comparing structures at multiple scales. Recursive approach compares the same subgraph pairs multiple times by calculating the similarity scores on smaller neighborhood. In this study, we created the graph using Universal Dependencies (UD) along with adjacency matrix. The subgraph mining was carried out using the following procedure. (i) First, assign the node degree to the entire graph-structured event sentence. (ii) Construct the subgraph from the large graph. (iii) Design a larger subgraph for the event sentence. (iv) Assign the low-rank approximation approach to entire subgraphs and each larger subgraphs.

2.2.3 Ensemble classification

In the biomedical domain, ensemble classification plays a vital role in improving overall performance for tasks such as article classification [76, 77] and relation extraction [6, 78]. SVM with an ensemble learning approach productively learns multiple training models through lowest time complexity. In the EnsembleSVM [67], bootstrapping strategy was employed to repeatedly learn the training models and aggregates the multiple training instances into the single predicted model. In this study, we employed EnsembleSVM [67] to generate the ensemble models for feature-based linear kernel and MLG kernel and merge them to a single classification model to efficiently categorize the events. Using EnsembleSVM we created models on bootstrap subsamples and trained ensembles of SVM models for feature based and MLG kernel, respectively. Figure 4 depicts a detailed explanation of the ensemble classification pipeline of our approach.

Ensemble classification of our approach described in Eq. 7:

$$ E_{k} = F_{k} + G_{k} $$

(7)

Here,$ E_{k} $,$ F_{k} $ and $ G_{k} $ were the kernel models in our classification problem. Using the “validation set”, we tuned various parameters using the grid search method in our model generation. In the features section, char n-gram was set to 3 and prefix/suffix feature assigned as two-character. In the MLG kernel model, parameters were optimally generated and finally set as the radius to 3, levels to 4, eta to 0.1, gamma to 0.01 and threads to 32. The tree value parameter grow was set to 1 to grow by leaf radius. This is for allowing the subgraphs to double in size at each level. We kept all these parameters to their default values during the model development.

2.3 Argument detection

After the identification of events and triggers, the next step is to extract arguments, which describe the events. To extract arguments for the events from the text efficiently and accurately, we used the lexico-syntactic pattern-based approach with semantic role labeling [79] which is briefly introduced below.

2.3.1 Lexico-syntactic pattern and semantic role-based rules engine

Lexico-syntactic patterns [80] are generalized linguistic structures for extracting related concepts and relationships between concepts from the text. Here the trigger words and propositions (synonym, subject, and verb) were the concepts and relationships to detect the event arguments. Lexico-syntactic patterns were used to structure the ontology of the words. Motivated by the work of Hung et al. [79], we employed lexico-syntactic patterns to identify {THEME, CAUSE} of the events. In the current study to identify arguments from the events, a combination of lexico- syntactic patterns and semantic matching were performed through three steps, namely contextual patterns, semantic role labeling, and event-specific argument structure, respectively. The list of bio event cues and trigger word list were used to match the arguments using pattern matching and role labeling. In the event-specific argument structure phase, post-processing rules were incorporated such as emphasizing event certainty and co-reference mentions. A detailed description of our lexico-syntactic pattern-based rule engine is depicted in Fig. 5. A brief explanation about each step incorporated in the rule engine with an example is discussed below.

Contextual patterns

Contextual patterns (CP) utilize domain specific information such as a trigger word list and tagged entities to annotate possible event arguments. The contextual patterns were employed with the following two components: subclass and complex. Subclass was utilized to detect and annotate the trigger word list and tagged entities using the pattern keyword (VP list, VP, NP) from the dependency parsed sentences. These patterns were also used to detect the prepositions (to, belong, with, without, etc.) between the trigger words and proteins. Tagged entities are represented in the sentence as ‘protein 1’ and ‘protein 2’ etc. For example: interact with Protein 1 and Protein 2. Complex Patterns were employed to identify the verb keywords which indicate multiple arguments of the same events. For example, protein1 interacts with protein 2 which catalyzes protein 3 and causes protein 4 downregulation. The above sentence contains multiple events which is represented by the cue words ‘catalyzes’, ‘interacts with’, ‘cause’. A full example of contextual pattern identification is shown below. For example: (PMID:9973520)

Rule 1:

Scenario: Identifying the subject and verb in a given sentence.

Original sentence: Cross-linking of CD44 on rheumatoid synovial cells up-regulates VCAM-1

After applying CP: NP → Cross-linking VP → up-regulates etc.,

Contextual patterns identify and annotate the possible trigger words and entities by utilizing the trigger word list in the sentence, which will be processed further by applying semantic role labeling techniques described below.

Semantic role labeling

Semantic role labeling (SRL) is a process in natural language processing to determine the relationship between the verb and syntactic structure of a sentence [79]. In our approach, semantic role labeling was used to search and determine the association between protein entities and trigger words in a sentence. It involves the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles. Here we have taken the sentence (A), Verb (VP), modal verb or preposition (M), participants in the sentence (P) and S as the subject of the sentence. From the above steps, we derived a semantic role labeling approach for our event arguments construction process.

For example, in the following sentence (PMID:9973520) “Cross-linking of CD44 on rheumatoid synovial cells up-regulates VCAM-1” the trigger word was “up-regulates”, the arguments were CD44, VCAM-1 and rheumatoid synovial cells and they participated in the event positive regulation. This was identified by applying a set of rules; the procedure of the same has been given in Table 2 in detail.

Table 2 Rules used for detecting the arguments of the events

Full size table

Event - specific argument structure

After extracting event arguments using SRL based patterns, we incorporated two post-processing rules as event-specific argument structures to raise the performance of our event argument detection approach. Event-specific argument structure was used to differentiate simple and complex events that specify the arguments directly or indirectly in the tagged sentences.

(i)
Searching for a connective pronoun such as “it” in the sentence which indicates the entity names (Protein) in the below example.

Examples of the generated rules: (Here D0, D1-Dependencies, ARG1, ARG2-Arguments of the particular word in the sentence)

it_D0_Arg1_Arg2

both_D0_ D1_Arg1_Arg2

that_ D0_D1_Arg2_Arg1

Example:

PMID: 10209041

Expression of GrpL is restricted to hematopoietic tissues and <Keyword> it </Keyword> is distinguished from Grb2 by having a proline-rich region.

In the above example the pronoun ‘it’ denotes the protein GrpL, and it participated in the event ‘Expression’.

(ii)
Searching for specific keywords such as ‘certainly’, ‘highly’, ‘confirm’, which were co-mentioned with trigger words ‘activation’ or ‘up-regulation’ so that event-specific meaningful sentence can be identified rather than a generalized one. This is also used to identify specific trigger words, which describe the event accurately from multiple trigger words in the same sentence.

Examples of the generated rules:

highly _D0_Arg1, probably_D0_D1_Arg1, certainly _D0_D1_Arg2, confirm _D0_Arg1 etc.,

<Keyword>confirm</Keyword><D0>that</D0><ARG1>binding</ARG1><D1>of </D1>endogenous <ARG2>NFkappaB</ARG2> and <ARG3>AP1<ARG3>.

Example:

PMID: PM9190901

We < Keyword > confirm < /Keyword > that binding of endogenous NFkappaB and AP1 is induced following PMA/ionomycin treatment of T cells.

In the above example, the keyword ‘confirm’ described the certainty of the event ‘binding’.

The analysis of training data was used to makeup the lexico-syntactic pattern-based rule engine to detect the participating themes in the events. We developed a pattern matching module using Java Regex [22,23,24] coupled with the above process to detect the arguments in the event classes.

3 Results and discussion

3.1 Dataset

For the first time, BioNLP-ST-2009 [16] introduced three tasks based on the GENIA corpus [20] for the detection of core events, recognition of event arguments and negation/speculation detection. In BioNLP-ST-2011 [17], the tasks were expanded with resources to capture more text and event types. In BioNLP-ST-2011, the GENIA Event extraction (GE) task has been kept and augmented with three focused event tasks, namely (i) epigenetic and post-translational modification (EPI), (ii) bacteria biotope (BB) and bacteria interaction (BI) and (iii) infectious diseases (ID) [17]. Application domains were further expanded in BioNLP-ST-2013 [18] while keeping the GE and BB; the additional tasks were cancer genetics (CG), gene regulation ontology (GRO), and pathway curation (PC).

To assess the performance of our approach, we employed four different corpora which includes three corpora from BioNLP-shared task (BioNLP-09 [16], BioNLP-11 [17], BioNLP-13 [18]), and one another standard corpus, namely GENIA-MK (Meta-knowledge) [21] which is currently available and widely used for event extraction tasks. All the four corpora were used to train and test the models of our approach. The corpus statistics of all three BioNLP-ST corpora and the GENIA-MK corpus are represented in Table 3.

Table 3 Corpus statistics (Abs—Abstract, Full—Full text articles)

Full size table

3.2 Evaluation metrics

Evaluation of our event extraction system was performed based on standard evaluation metrics precision (P), recall (R), and F-Score (F). The shared task online evaluation server was used to perform the evaluation of the BioNLP-ST (2009, 2011, 2013). The results reported in our system are based on Approximate span matching and Approximate string matching evaluation measures. For the GENIA-MK corpus evaluation, we used 10-fold cross validation. In the 10-fold cross validation the GENIA-MK corpus was divided into 10 subsets. Every run, 90% of the data was used as the training set, and the remaining 10% was used as the test set.

3.3 Evaluation results

We trained and tested our approach on BioNLP-ST 2009, 2011, 2013 and GENIA-MK corpus with the Feature-based linear kernel, MLG kernel, and Ensemble kernel. Following training and testing, approaches were carried out to assess the performance of our approach.

In Table 4, first, we implemented the ensemble feature-based approach on the BioNLP-ST-2009 corpus. By analyzing Table 4 feature-based approach results in high precision and low recall. Next, we deployed the ensemble MLG kernel-based approach to the corpus, and it results in high precision and high recall and moderately increases the F-score. Finally, we combined both ensemble kernels, which takes the benefits of both feature-based and MLG kernel-based output models and attained the comparative F-score. Likewise, we applied the above methods in BioNLP-ST-2011 and BioNLP-ST-2013 corpus.

Table 4 Results on BioNLP-ST Corpora

Full size table

In Table 5, we implemented the same approach on the GENIA-MK corpus. Experimental results show that our approach attained the best results compare to the BioNLP-ST corpora. Figure 6 depicts the Receiver Operating Characteristic (ROC) curve of the three kernels for all four corpora.

Table 5 Results on GENIA-MK Corpus

Full size table

To classify the events individually, every event type needs a variety of features to reflect the diverse context and linguistic characteristics. For example, compared to the events such as gene expression, transcription, localization, the regulation events need more token-based, concept based and syntactic information. By the implementation of a feature-based approach in our study, we properly modeled the higher complexity associated with their phrasal and linguistic contexts and consequently prepared our model to identify the individual events. Next, the feature-based approach was coupled with MLG kernel that takes advantage of both feature-based and graph-kernel based approaches and generated state-of-the-art performances in the extraction of individual classes of events. Table 6 shows results for individual classes of events in the four corpora by employing our ensemble approach.

Table 6 Results for individual classes of events on BioNLP-ST 2009, 2011, 2013 and GENIA-MK

Full size table

Next, we compare our approach with other state-of-the-art approaches developed on the BioNLP-ST 2009, 2011, 2013 and GENIA-MK corpora. Comparisons show that our proposed approach performs better than other state-of-the-art approaches. Tables 7 and 8 show the comparisons.

Table 7 Comparative analysis on the BioNLP-ST CORPORA (BioNLP-ST-09, BioNLP-ST-11, and BioNLP-ST-13) based on F-score (%)

Full size table

Table 8 Comparative analysis on the GENIA-MK Corpus in terms of F-score

Full size table

3.4 Discussion

In our methodology, we implemented a Feature-based linear kernel, MLG kernel, and lexico-syntactic pattern-based approaches to extract biomedical events with unique steps. Some interesting findings encountered from our approach are discussed below. The baseline feature-based linear kernel captured grammatical, syntactical, morphological, orthographical, and sentence level global information successfully. Morphological and orthographical features were used to describe the structure of the word in a given sentence. Parsing features were employed to discover the grammatical and syntactical expressions of the event sentences. Packing these above features and methodologies in the feature-based linear kernel gave the perfect baseline to extract the events from the biomedical literature.

The MLG kernel was used to compare the structure of the graph at multiple different scales. Mining subgraphs is an important phase in the MLG kernel because each generated subgraph will be compared by its constituent sub-subgraphs. MLG kernel first accepts a universal dependency structure, in which a direct dependency relationship path between the trigger words and named entities. MLG kernel combines baseline graph Laplacian kernel with feature representations originating from nested neighborhoods. Finally, MLG kernel considers both overall and local graph structures to learn similarities at multiple different levels. By considering all this we believe that by employing MLG kernel, our system was not only able to capture the topological relationships between the individual event nodes but also identifies the topological relationships between the subgraphs.

Example:

PMCID 1310901

Sentence: Down regulation of interferon regulatory factor 4 gene expression in leukemic cells

In the above example, words in the sentences were converted to universal dependencies and then to the adjacency matrix. The MLG kernel first assigns the node degrees based on UD and adjacency matrix to the graph generated for the sentence. In our case, in this example, words like expression and factor were assigned with high node degree.

In general, the graph structure is captured at multiple scales in MLG. This is achieved by increasing the depth of the neighborhood vertices in the graph. In addition, MLG focuses on capturing the neighborhood similarity among the vertices and uses this similarity score to induce the feature vectors. The current study exploits the above technique in which the biomolecular event sentence is searched at multiple scales for finding the relations between events and the target proteins using the graph generated from the corresponding adjacency matrix of the sentence. An interesting connection to be noted is that the cue words like gene and expression, regulation and factor, leukemia and cells were connected in the graph. In the following steps, a subgraph mining from the sentence graph followed by the building of larger subgraphs was performed. As a result of this step, words like interferon, regulatory, factor, gene, expression are added into a single subgraph. So, we strongly believe that our subgraph mining based MLG kernel played an important role in capturing the key information about the biomedical event sentences.

The association among the subgraphs for complex event extraction using MLG kernel is represented in Fig. 7 for a sample sentence from PMID 1335418. From the sentence, the MLG kernel first detects the small subgraphs in level one as entity names and event trigger words (For example, cAMP and accumulation). In level two, the kernel identifies the relationship between the trigger word and the corresponding proteins by accumulating multiple subgraphs (activation, cells, jkat, protein, kinase, cells). Finally, in level three the larger subgraphs were mined, thereby identifying the complex event. The repeated subgraph mining process was done until the low-rank approximation was observed to improve the classification accuracy.

Example:

PMID: 1335418

We have earlier found that in Jurkat cells activation of protein kinase C (PKC) enhances the cyclic adenosine monophosphate (cAMP) accumulation induced by adenosine receptor stimulation or activation of Gs.

Next, in argument detection, we employed lexico-syntactic based semantic role labeling and contextual pattern-based rules to extract the event arguments efficiently. Lexico-syntactic patterns were used to detect domain-specific ontology-based concepts and relationships effectively. In the event extraction task, lexico-syntactic patterns with semantic role labeling process require significantly less time to compare normal lexico-syntactic patterns. The examples were illustrated in detail in the methods Sect. 2.3.1. A few interesting advantages of using lexico-syntactic patterns to event argument detection are illustrated in the following examples.

Example 1:

PMID: 10330189

In response to activation of the Wnt signaling pathway, beta-catenin accumulates in the nucleus, where <Keyword> it </Keyword> cooperates with LEF/TCF (for lymphoid enhancer factor and T-cell factor) transcription factors to activate gene expression.

In the above example 1 the pronoun “it” denotes the protein beta-catenin, and it participated in the events “gene expression” and “transcription”.

Example 2:

PMID: 10087185

Induction of NFkappaB is a <Keyword>highly</Keyword> regulated process requiring Phosphorylation.

In the above example 2, the keyword “highly” denoted the certainty of the event “Phosphorylation”.

The event-specific argument structure based syntactic rules were applied after contextual patterns and semantic role labeling to detect arguments. The event-specific argument structures-based rules acted as post-processing and improved the performance of the argument detection phase.

Even though our system performs well, it exhibits some limitations. The major source of errors that occurred in the argument detection phase is concerned with events containing multiple arguments. If the event contains more than three arguments, those types were difficult to extract. For example (PMID: 1313226), in the sentence “Leukotriene B4 stimulates c-fos and c-jun gene transcription and AP-1 binding activity in human monocytes”. The event regulation contains more arguments and simultaneously it consisted of other events also as an argument.

4 Conclusions and future enhancements

In this paper, we deployed a hybrid system by combining methodologies such as the ensemble feature, graph-based kernels along with lexico-syntactic patterns to extract biomedical events from the literature. Our Multiscale Laplacian Graph (MLG) kernel-based approach can detect the topological relationships between events nodes in multiple scales and identifies the associations among the subgraphs for complex events. To the best of our knowledge, we are the first ones to introduce the MLG kernel for event extraction task. Since features play a crucial role in supervised machine learning, especially in event extraction a wide variety of features represented broadly as token-based, sentence based, parsing and domain-specific features to generate a feature-based kernel. Finally, we combined both ensemble kernels to generate a robust event classifier. In addition, in the argument detection phase we employed lexico-syntactic based semantic role labeling and contextual pattern-based rule engine to extract the event arguments. We incorporated contextual patterns, semantic role labeling, and event-specific argument structure to detect the domain-specific ontology-based concepts and relationships effectively. In the future, we plan to employ the automatic feature extraction approaches, advanced universal dependencies and different coefficient pair for kernel ensembling to extract the events from the literature, and we will apply this system in various biological relation extraction approaches such as Chemical Induced Disease (CID), Disease-Drug Interactions (DDIs) and Protein–Protein Interactions (PPIs).

Data availability

The link of source code and generated models is available at: http://biominingbu.org/bioevent_extraction/.

References

Gonzalez GH, Tahsin T, Goodale BC, Greene AC, Greene CS (2015) Recent advances and emerging applications in text and data mining for biomedical discovery. Brief Bioinform 17(1):33–42
Google Scholar
Cohen AM, Hersh WR (2005) A survey of current work in biomedical text mining. Brief Bioinform 6:57–71
Google Scholar
Jesús Naveja J, Dueñas-González A, Medina-Franco JL (2016) Drug repurposing for epigenetic targets guided by computational methods. In: Medina-Franco José L (ed) Epi-informatics discovery and development of small molecule epigenetic drugs and probes. Academic Press, Cambridge, pp 327–357
Google Scholar
Henry S, McInnes BT (2017) Literature based discovery: models, methods, and trends. J Biomed Inform 74:20–32
Google Scholar
Bundschus M, Dejori M, Stetter M, Tresp V, Kriegel HP (2008) Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 9(1):207
Google Scholar
Murugesan G, Abdulkadhar S, Natarajan J (2017) Distributed smoothed tree kernel for protein–protein interaction extraction from the biomedical literature. PLoS ONE 12(11):e0187379
Google Scholar
Bhasuran B, Natarajan J (2018) Automatic extraction of gene–disease associations from literature using joint ensemble learning. PLoS ONE 13(7):e0200699
Google Scholar
Panyam NC, Verspoor K, Cohn T, Ramamohanarao K (2018) Exploiting graph kernels for high performance biomedical relation extraction. J Biomed Semantics 9(1):7
Google Scholar
Zhou H, Ning S, Yang Y, Liu Z, Lang C, Lin Y (2018) Chemical-induced disease relation extraction with dependency information and prior knowledge. J Biomed Inform 84:171–178
Google Scholar
Rios A, Kavuluru R, Lu Z (2018) Generalizing biomedical relation classification with neural adversarial domain adaptation. Bioinformatics 26(1):9
Google Scholar
Vanegas JA, Matos S, Gonzalez F, Oliveira JL (2015) An overview of biomolecular event extraction from scientific documents. Comput Math Methods Med 015:571381
Google Scholar
Ananiadou S, Pyysalo S, Tsujii JI, Kell DB (2010) Event extraction for systems biology by text mining the literature. Trends Biotechnol 28(7):381–390
Google Scholar
Patumcharoenpol P, Doungpan N, Meechai A, Shen B, Chan JH, Vongsangnak W (2016) An integrated text-mining framework for metabolic interaction network reconstruction. PeerJ 4:e1811
Google Scholar
Nawaz R, Thompson P, Ananiadou S (2013) Negated bio-events: analysis and identification. BMC Bioinformatics 14(1):14
Google Scholar
Wang X, McKendrick I, Barrett I, Dix I, French T, Tsujii JI, Ananiadou S (2011) Automatic extraction of angiogenesis bioprocess from text. Bioinformatics 27(19):2730–2737
Google Scholar
Kim JD, Ohta T, Pyysalo S, Kano Y, Tsujii J (2009) Overview of BioNLP’09 shared task on event extraction. In: Proceedings of BioNLP’09 shared task workshop, pp 1–9
Kim JD, Wang Y, Takagi T, Yonezawa A (2011) Overview of Genia event task in BioNLP shared task 2011. In: Proceedings of BioNLP shared task 2011 workshop, pp 7–15
Nedellec C, Bossy R, Kim JD, Kim JJ, Ohta T, Pyysalo S, Zweigenbaum P (2013) Overview of BioNLP shared task 2013. In: Proceedings of BioNLP shared task 2013 workshop, pp 1–7
Delėger L, Bossy R, Chaix E, Ba M, Ferrė A, Bessieres P, Nėdellec C (2016) Overview of the bacteria biotope task at bionlp shared task 2016. In: Proceedings of the 4th BioNLP shared task workshop 2016, pp 12–22
Kim JD, Ohta T, Tateisi Y, Tsujii JI (2003) GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 19:180–182
Google Scholar
Thompson P, Nawaz R, McNaught J, Ananiadou S (2011) Enriching a biomedical event corpus with meta-knowledge annotation. BMC Bioinformatics 12(1):393
Google Scholar
Zerva C, Batista-Navarro R, Day P, Ananiadou S (2017) Using uncertainty to link and rank evidence from biomedical literature for model curation. Bioinformatics 33(23):3784–3792
Google Scholar
Le Minh Q, Truong SN, Bao QH. A pattern approach for biomedical event annotation. In: Proceedings of the BioNLP shared task 2011 workshop, pp 149–150
Kilicoglu H, Bergler S (2009) Syntactic dependency-based heuristics for biological event extraction. In: Proceedings of the workshop on current trends in biomedical natural language processing: shared task, pp 119–127
Liu X, Bordes A, Grandvalet Y (2013) Biomedical event extraction by multi-class classification of pairs of text entities. In: BioNLP shared task 2013 workshop, pp 45–49
Zhou D, He Y (2011) Biomedical events extraction using the hidden vector state model. Artif Intell Med 53(3):205–213
Google Scholar
Li C, Liakata M, Rebholz-Schuhmann D (2013) Biological network extraction from scientific literature: state of the art and challenges. Brief Bioinform 15(5):856–877
Google Scholar
Zhou D, Zhong D, He Y (2014) Event trigger identification for biomedical events extraction using domain knowledge. Bioinformatics 30(11):1587–1594
Google Scholar
Lamurias A, Rodrigues MJ, Clarke LA, Couto FM (2016) Extraction of regulatory events using kernel-based classifiers and distant supervision. In: Proceedings of the 4th BioNLP shared task workshop, pp 88–92
Wang A, Wang J, Lin H, Zhang J, Yang Z, Xu K (2017) A multiple distributed representation method based on neural network for biomedical event extraction. BMC Med Inform Decis Mak 17(3):171
Google Scholar
He X, Li L, Liu Y, Yu X, Meng J (2017) A two-stage biomedical event triggers detection method integrating feature selection and word embeddings. In: IEEE/ACM transactions on computational biology and bioinformatics
Jiang N, Rong W, Nie Y, Shen YK, Xiong Z (2017) Biological event trigger identification with noise contrastive estimation. IEEE/ACM Trans Comput Biol Bioinform 15:1549–1559
Google Scholar
Bjorne J, Heimonen J, Ginter F, Airola A, Pahikkala T, Salakoski T (2009) Extracting complex biological events with rich graph-based feature sets. In: Proceedings of the workshop on current trends in biomedical natural language processing: shared task, pp 10–18
Bjorne J, Salakoski T (2013) TEES 2.1: automated annotation scheme learning in the BioNLP 2013 shared task. In: Proceedings of the BioNLP shared task 2013 workshop,pp 16–25
Hakala K, Van Landeghem S, Salakoski T, Van de Peer Y, Ginter F (2013) EVEX in ST’13: Application of a large-scale text mining resource to event extraction and network construction. In: Proceedings of the BioNLP shared task 2013 workshop, pp 26–34
Riedel S, McClosky D, Surdeanu M, McCallum A, Manning CD (2011) Model combination for event extraction in BioNLP 2011. In: Proceedings of the BioNLP shared task 2011 workshop, pp 51–55
Lever J, Jones SJ (2016) VERSE: event and relation extraction in the BioNLP 2016 shared task. In: Proceedings of the 4th BioNLP shared task workshop, pp 42–49
Bjorne J, Salakoski T (2015) TEES 2.2: biomedical event extraction for diverse corpora. BMC Bioinform 16(16):4
Google Scholar
Liu H, Komandur R, Verspoor K (2011) From graphs to events: a subgraph matching approach for information extraction from biomedical text. In: Proceedings of the BioNLP shared task 2011 workshop, pp 164–172
Liu H, Hunter L, Kešelj V, Verspoor K (2013) Approximate subgraph matching-based literature mining for biomedical events and relations. PLoS ONE 8(4):e60954
Google Scholar
Liu H, Verspoor K, Comeau DC, MacKinlay AD, Wilbur WJ (2015) Optimizing graph-based patterns to extract biomedical events from the literature. BMC Bioinform 16(16):S2
Google Scholar
Luo Y, Uzuner Ö, Szolovits P (2016) Bridging semantics and syntax with graph algorithms—state-of-the-art of extracting biomedical relations. Brief Bioinform 18(1):160–178
Google Scholar
Luo Y, Sohani AR, Hochberg EP, Szolovits P (2014) Automatic lymphoma classification with sentence subgraph mining from pathology reports. J Am Med Inform Assoc 21(5):824–832
Google Scholar
Luo Y, Xin Y, Hochberg E, Joshi R, Uzuner O, Szolovits P (2015) Subgraph augmented non-negative tensor factorization (SANTF) for modeling clinical narrative text. J Am Med Inform Assoc 22(5):1009–1019
Google Scholar
Luo Y, Uzuner O (2014) Semi-supervised learning to identify UMLS semantic relations. In: AMIA summits on translational science proceedings, p 67
Zhang Y, Lin H, Yang Z, Wang J, Li Y (2013) Biomolecular event trigger detection using neighborhood hash features. J Theor Biol 7(318):22–28
MATH Google Scholar
Roberts K, Rink B, Harabagiu S (2010) Extraction of medical concepts, assertions, and relations from discharge summaries for the fourth i2b2/VA shared task. In: Proceedings of the 2010 i2b2/VA workshop on challenges in natural language processing for clinical data, i2b2 2010, Boston, MA, USA
Bùi QC (2012) Relation extraction methods for biomedical literature
Quirk C, Choudhury P, Gamon M, Vanderwende L (2011) Msr-nlp entry in bionlp shared task 2011. In: Proceedings of the BioNLP shared task 2011 workshop, pp 155–163
Dongliang X, Jingchang P, Bailing W (2017) Multiple kernels learning-based biological entity relationship extraction method. J Biomed Semant 8(1):38
Google Scholar
Nikolentzos G, Siglidis G, Vazirgiannis M (2019) Graph Kernels: A Survey. arXiv preprint arXiv:1904.12218
Panyam NC, Verspoor K, Cohn T, Ramamohanarao K (2018) Exploiting graph kernels for high performance biomedical relation extraction. J Biomed Semant 9(1):7
Google Scholar
Kondor R, Pan H (2016) The multiscale Laplacian graph kernel. In: Advances in neural information processing systems, pp 2990–2998
McClosky D, Surdeanu M, Manning CD (2011) Event extraction as dependency parsing. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Vol 1, pp 1626–1635
Riedel S, McCallum A (2011) Robust biomedical event extraction with dual decomposition and minimal domain adaptation. In: Proceedings of the BioNLP shared task 2011 workshop 2011, pp 46–50
Munkhdalai T, Namsrai OE, Ryu KH (2015) Self-training in significance space of support vectors for imbalanced biomedical event data. BMC Bioinform 16(7):S6
Google Scholar
Li L, Liu S, Qin M, Wang Y, Huang D (2016) Extracting biomedical event with dual decomposition integrating word embeddings. IEEE/ACM Trans Comput Biol Bioinform 13(4):669–677
Google Scholar
Wang Y, Wang J, Lin H, Tang X, Zhang S, Li L (2018) Bidirectional long short-term memory with CRF for detecting biomedical event trigger in FastText semantic space. BMC Bioinform 19(20):507
Google Scholar
Baldridge J (2005) The OpenNLP project. https://opennlp.apache.org/index.html. Accessed March 2015)
Liu H, Christiansen T, Baumgartner WA, Verspoor K (2012) BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. J Biomed Semant 3(1):3
Google Scholar
Pado S, Lapata M (2007) Dependency-based construction of semantic space models. Comput Linguist 33(2):161–199
MATH Google Scholar
De Marneffe MC, MacCartney B, Manning CD (2006) Generating typed dependency parses from phrase structure parses. In: Proceedings of LREC, pp 449–454
Sagae K, Tsujii JI (2007) Dependency parsing and domain adaptation with LR models and parser ensembles. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLPCoNLL)
Bhasuran B, Murugesan G, Abdulkadhar S, Natarajan J (2016) Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases. J Biomed Inform 31(64):1–9
Google Scholar
Lee S, Kim D, Lee K, Choi J, Kim S, Jeon M, Lim S, Choi D, Kim S, Tan AC, Kang J (2016) BEST: next-generation biomedical entity search tool for knowledge discovery from biomedical literature. PLoS ONE 11(10):e0164680
Google Scholar
Murugesan G, Abdulkadhar S, Bhasuran B, Natarajan J (2017) BCC-NER: bidirectional, contextual clues named entity tagger for gene/protein mention recognition. EURASIP J Bioinform Syst Biol 2017(1):7
Google Scholar
Claesen M, De Smet F, Suykens JA, De Moor B (2014) EnsembleSVM: a library for ensemble learning using support vector machines. J Mach Learn Res 15(1):141–145
MATH Google Scholar
Bjorne J, Salakoski T. Generalizing biomedical event extraction. In: Proceedings of the BioNLP shared task 2011 workshop, pp 183–191
Li Q, Ji H, Huang L (2013) Joint event extraction via structured prediction with global features. In: ACL, vol 1, pp 73–82
Campos D, Bui QC, Matos S, Oliveira JL (2014) TrigNER: automatically optimized biomedical event trigger recognition on scientific documents. Source Code Biol Med 9(1):1
Google Scholar
Campos D, Matos S, Oliveira JL (2013) Gimli: open source and high-performance biomedical name recognition. BMC Bioinform 14(1):54
Google Scholar
Boutet E, Lieberherr D, Tognolli M, Schneider M, Bairoch A (2007) Uniprotkb/swiss-prot. In: Edwards D (ed) Plant bioinformatics. Humana Press, Totowa, pp 89–112
Google Scholar
Dunning T (2012) Finding structure in text, genome and other symbolic sequences. arXiv preprint arXiv:1207.1847
Naughton M, Stokes N, Carthy J (2008) Investigating statistical techniques for sentence-level event classification. In: Proceedings of the 22nd international conference on computational linguistics, vol 1. Association for Computational Linguistics, pp 617–624
Kondor R, Jebara T (2003) A kernel between sets of vectors. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 361–368
Chen Y, Hou P, Manderick B (2014) An ensemble self-training protein interaction article classifier. Bio-Med Mater Eng 24(1):1323–1332
Google Scholar
Abdulkadhar S, Murugesan G, Natarajan J (2017) Classifying protein–protein interaction articles from biomedical literature using many relevant features and context-free grammar. J King Saud Univ Comput Inf Sci 32:553–560
Google Scholar
Li L, Guo R, Jiang Z, Huang D (2015) An approach to improve kernel-based protein–protein interaction extraction by learning from large-scale network data. Methods 15(83):44–50
Google Scholar
Hung SH, Lin CH, Hong JS (2010) Web mining for event-based commonsense knowledge using lexico-syntactic pattern matching and semantic role labeling. Expert Syst Appl 37(1):341–347
Google Scholar
Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on computational linguistics, vol 2. Association for Computational Linguistics, pp 539–545
Miwa M, Sætre R, Kim JD, Tsujii JI (2010) Event extraction with complex event classification using rich features. J Bioinform Comput Biol 8(01):131–146
Google Scholar
Bui QC, Campos D, Van Mulligen E, Kors J (2013) A fast rule-based approach for biomedical event extraction. In: Proceedings of the BioNLP shared task 2013 workshop, pp 104–108
Björne J, Salakoski T (2018) Biomedical event extraction using convolutional neural networks and dependency parsing. In: Proceedings of the BioNLP 2018 workshop, pp 98–108
Miwa M, Thompson P, McNaught J, Kell DB, Ananiadou S (2012) Extracting semantically enriched events from biomedical literature. BMC Bioinform 13(1):108
Google Scholar

Download references

Acknowledgements

There is no separate funding received for this research work.

Author information

Authors and Affiliations

Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, TamilNadu, 641046, India
Sabenabanu Abdulkadhar & Jeyakumar Natarajan
DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, TamilNadu, 641046, India
Balu Bhasuran & Jeyakumar Natarajan

Authors

Sabenabanu Abdulkadhar
View author publications
You can also search for this author in PubMed Google Scholar
Balu Bhasuran
View author publications
You can also search for this author in PubMed Google Scholar
Jeyakumar Natarajan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jeyakumar Natarajan.

Ethics declarations

Ethical approval

Datasets used in the current work are all from BioNLP-shared tasks and GENIA, which are freely available for research work with suitable citations. Implementations of kernel approaches and Natural language processing methods used in the current work are all available as open source software with suitable citations.

Conflict of interest

The authors declare that there are no conflicts of interest in this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

S1: Pseudo code for proposed approach (PDF 68 kb)

S2: Pseudo code for MLG Kernel (PDF 105 kb)

S3: Feature Vector Generation (PDF 124 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abdulkadhar, S., Bhasuran, B. & Natarajan, J. Multiscale Laplacian graph kernel combined with lexico-syntactic patterns for biomedical event extraction from literature. Knowl Inf Syst 63, 143–173 (2021). https://doi.org/10.1007/s10115-020-01514-8

Download citation

Received: 18 May 2019
Revised: 22 September 2020
Accepted: 27 September 2020
Published: 24 October 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s10115-020-01514-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multiscale Laplacian graph kernel combined with lexico-syntactic patterns for biomedical event extraction from literature

Abstract

Similar content being viewed by others

Exploiting graph kernels for high performance biomedical relation extraction

Optimizing graph-based patterns to extract biomedical events from the literature

Bio-molecular event extraction by integrating multiple event-extraction systems

Explore related subjects

1 Introduction

Example:

Original Sentence:

Tagged Sentence:

1.1 Background

1.2 Related work

2 Methods

2.1 Text pre-processing

2.1.1 Text preparation and cleaning

2.1.2 Dependency parsing

2.1.3 Named entity recognition (NER)

2.2 Event identification

2.2.1 Feature-based kernel

2.2.2 Multiscale Laplacian Graph (MLG) kernel

2.2.3 Ensemble classification

2.3 Argument detection

2.3.1 Lexico-syntactic pattern and semantic role-based rules engine

Example:

Example:

3 Results and discussion

3.1 Dataset

3.2 Evaluation metrics

3.3 Evaluation results

3.4 Discussion

Example:

Example:

Example 1:

Example 2:

4 Conclusions and future enhancements

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical approval

Conflict of interest

Additional information

Publisher's Note

Electronic supplementary material

S1: Pseudo code for proposed approach (PDF 68 kb)

S2: Pseudo code for MLG Kernel (PDF 105 kb)

S3: Feature Vector Generation (PDF 124 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation