1 Introduction

Advances in both biological and computational methods act as the catalyst for a large number of publications, especially in the biomedical domain [1]. Life science research outputs are widely disseminated as scientific articles, which can act as a source for knowledge discovery [2]. Recently biomedical text mining applications are developed using this literature with a focus on biological and clinical domain areas such as screening of clinical trials, pharmacogenetics, reaction detection and repurposing of drugs [3].

Initial efforts on text mining in the biomedical domain had a major focus on fundamental tasks like categorizing bio-entities (genes, proteins, diseases, and drugs) and extracting binary relationships (protein–protein interaction, gene–disease associations and disease–disease associations) between the entities [4]. Extracting relations from biomedical literature is a significant task in the area of semantic mining of text [5]. Some of the recent relation extraction strategies applied to various biomedical problems such as protein–protein interactions (PPIs) [6], gene–disease associations [7], chemical-induced disease (CID) [8] and chemical–disease relation (CDR) [9]. Biomedical relation classification task focusing on PPI and drug–drug interaction [10] shows the importance and applications of relation extraction from the literature.

Following the success of the relation extraction task, the next focus is on to extract related biomolecular events from the text. In general, bio-event is the textual event specialized for the biomedical domain and dynamic bio-relation involving one or more participants, and these participants can be bio-entities or bio-events and are usually each assigned a semantic role like the theme and cause [11, 12]. Bio-event extraction can help us to understand certain biological processes such as pathway reconstruction [13], semantic search [14], association mining for knowledge discovery, and bioprocess extraction [15]. Automatically, extracting events from the biomedical text is a challenging task because of the uncertainty and assortment of NLP processing such as negations and speculations, which occur in the biological text and can lead to misunderstanding and incorrect interpretation [11, 12].

The bio-event extraction process consists of two common steps, trigger detection and argument detection. Identifying trigger words comprises the detection of event triggers and their types, as quantified by the selected ontology [11]. Argument detection, known as edge detection or event theme construction is the process of detecting arguments for the events. The arguments can be named entities (genes, proteins, diseases) or events represented by trigger words [11, 12, 16]. Consider the following example below.

Example:

PMCID: 1310901

Original Sentence:

Down-regulation of interferon regulatory factor 4 gene expression in leukemic cells.

Tagged Sentence:

<trigger>Down-regulation </trigger> of <theme>interferon regulatory factor 4 </theme><trigger>gene expression </trigger> in leukemic cells.

Here the trigger words ‘downregulation’ and ‘expression’ denote the two events - regulation and gene expression, and the gene ‘interferon regulatory factor 4’ is the theme representing the argument in the sentence.

There has been a wider acceptance of the notion that biomolecular events can play a crucial role in molecular mechanisms of diseases and can be linked with interactions in pathways and networks [12, 16]. Due to this and other various reasons, notable shared task community challenges BioNLP-ST (Biomedical Natural Language Processing Shared Task) in 2009 [16], 2011 [17], 2013 [18] and 2016 [19] were organized specifically in focus on biomolecular event extraction from the literature. The core problem in these tasks was the extraction of biomolecular events from standard datasets, which is based on the GENIA corpus [20]. The GENIA corpus enriched with domain-specific meta-knowledge and it was named as GENIA-MK (Meta-knowledge) corpus [21]. The GENIA-MK corpus contains human curated annotations of 9,372 sentences from 1000 abstracts in which 36,858 typed, complex and nested events were represented [21]. Recently Zerva et al. [22] proposed a hybrid approach combining a random forest with generic rule patterns, which uses dependency between trigger words and cues of the uncertainty events and achieved an F-Score of 88% in the GENIA-MK corpus.

1.1 Background

Different text-mining approaches have been developed utilizing techniques such as rule-based [23], dictionary based [24], machine learning [25], and hybrid approaches [26]. In particular, the Support Vector Machines algorithms with rule-based or dictionary-based approaches are widely used in extracting biomolecular events [27]. In spite of several existing approaches, the challenge is still open and leaves space for improvement. For example, pattern matching and dictionary-based approaches achieved moderate results in complex event extraction processes such as regulation, negative regulation, and positive regulation [11]. Machine learning based studies [25] employed different strategies such as kernel-based learning [28, 29], deep learning based [30,31,32], graph-based learning [33,34,35,36,37,38,39,40,41] and hybrid approaches [26] to extract the biomedical events efficiently.

Recently, the enriched graph-based features played an important role to extract the events from the text and created the best systems for the classification of biological events [42]. The advantage of using graph-based approaches for event extraction includes the use of structural properties of the sentence such as semantic and syntactic features, path features, and similarity features. This was briefly explained in the review [42]. Earlier, various graph-based approaches like subgraph mining [43,44,45], random walk [46], shortest paths [47], subgraph matching [39,40,41, 48] and hybrid methods [49] were introduced to extract the biomedical events from the literature.

Subgraph mining is the process of extracting the important concepts from the graph [43,44,45]. Random walk explains the path consists of random steps between one node (bio-entity) to another node (bio-event) in the graph [46]. The shortest path is the shortest optimized path between two nodes (entity and event) [47]. The graph matching techniques are utilized to find whether one text could be inferred from another by using the dependency parsing of the two texts [39]. Subgraph matching techniques are utilized to extract the maximum common subgraph between two graphs [39,40,41]. On the other side, kernel-based approaches integrated with graphs produced efficient results in relation extraction tasks [50, 51]. A graph kernel was generated using dependency parsing techniques in which each graph contains the dependency structure and the linear order of the words [52]. In this study, we employed a special graph kernel named Multiscale Laplacian Graph (MLG) kernel [53] integrated with the linear feature-based kernel to extract the biological events from the text. The MLG-Kernel was used to compare the structure of the graph at multiple different scales. The motivation behind employing MLG is that it not only captures the topological relationships between the individual event nodes but also identifies the topological relationships between the subgraphs [53]. The following section briefly describes state-of-the-art approaches for the task of biomedical event extraction.

1.2 Related work

Bjorne et al. [33] used n-gram features and shortest path syntactic dependencies between event arguments and rule-based graph pruning to extract the events and attained the F-score 51.95% in the BioNLP-ST-2009 task dataset. The disadvantage of this approach is the lowest trigger detection performance on the test set. In 2013, BioNLP-ST, Bjorne and Salakoski [34] presented an automated event extraction system named TEES 2.1. It is a machine learning based tool for extracting text bound graphs from natural language articles, they represent both binary relations and events with a unified graph format where named entities and triggers are nodes and relations and event arguments are edges and reported an F-score of 50.74%. The lack of using learning rules caused defects in the argument detection phase; for example, consider an event with multiple optional arguments, such as Cell differentiation from the CG task with 0–1 AtLoc argument and 0–1 Theme arguments. While it can be possible that such an event can exist without any arguments at all, it is often the case that at least one of the optional arguments must be present. Hakala et al. [35] used graph represented features including paths connecting nested events and the occurrence of a pair of entities such as gene, protein in general subgraphs mined from external PubMed and PMC abstracts reported the best F-score of 50.97% in BioNLP-ST-2013. The main limitation of this system is that it increases only precision not recall.

In BioNLP-ST-2011 Riedel et al. [36], extracted event arguments by scoring candidate subgraphs to rank event pairs and achieved the F-score of 57.46%. In this system, they employed stacking and the UMass model (trained model which consists of trigger labels, events arguments and protein pairs) to extract the events. Stacking led to better performance in this system but a combination of stacking with the UMass model caused slight variation in the performance on the test sets. McClosky [54] converted annotated event structure in the training data to an event dependency graph that takes entities (event arguments) as vertices and edges and attained the F-score of 50% in BioNLP-ST-2011. Riedel and McCallum [55] implemented stacking procedure and combined their approach with McClosky [54] extracted event arguments by scoring candidate subgraphs to rank event arguments and achieved the F-score of 56.05% in the BioNLP-ST-2011 dataset; the limitation of this approach is that it is harder to extract full text events.

Liu et al. [39, 40] implemented Exact Subgraph Matching and Approximate Subgraph Matching (ESM/ASM) approaches to extract the events from the literature efficiently. In their method, they applied ESM/ASM from sentence graphs to event graphs, employed a distance metric to every vertex of the subgraphs, and attained the F-score of 51.12% in the BioNLP-ST-2011 dataset. The lack of post-processing rules and inconsistencies in the gold annotation caused more false positives and false negatives in this system. Liu et al. [41] further improved their ESM/ASM based approach with the distributional similarity model (DSM), optimized graph features, and attained the F-score of 55.09% in BioNLP-ST- 2013. The limitation of this approach is low recall due to ‘Site’ entity recognition.

Apart from the above graph-based approaches, recently different classification approaches were also deployed to extract the biomedical events efficiently [30, 56,57,58]. Some of the notable works are discussed here. Munkhdalai et al. [56] proposed a new semi-supervised learning method which was named self-training in significance space (STSS) to solve the imbalanced data problem and attained the F-score of 54.30% in BioNLP-ST-2011.The system performance is lower in terms of F-measure because of the computational requirements. Wang et al. [30] presented a multiple distributed representation method which combines dependent context formed by word embedding with task-based features from biomedical text and fed it to deep learning models and achieved the F-scores 59.94%, 55.20%, and 50.12% in BioNLP-ST-2009, 2011, 2013 datasets, respectively; this method still needs manually designed features, which limits the power of generalization. Li et al. [57] used an optimization method named dual decomposition method along with dependency parse based rich features, unsupervised word features and extracted the events with F-scores 56.09% and 53.19% in BioNLP-ST- 2009, 2013. Recently, Wang et al. [58] implemented a Bidirectional Long Short Term Memory (Bidirectional-LSTM) approach for event extraction on Multi-Level Event Extraction (MLEE) corpus. Furthermore, for generalizing their approach they used BioNLP-ST-2009, 2011, 2013 corpora and achieved the F-scores more than 60% in the development set.

There is an increasing importance for biomolecular event applications and the current trends in biomedical relation extraction tasks, which uses ensemble learning methods and graph-based approaches [33, 42]. The motivations of our work integrate a Multiscale Laplacian Graph (MLG) kernel with a feature kernel as an ensemble model for the event extraction task. The challenge of the current study was the extraction of complex events using subgraph mining thereby gaining a deeper understanding of the biomolecular events. Kondor and Pan [53] first introduced MLG, and it was used to compare the structure in graphs simultaneously at multiple different scales. The objective of employing MLG in our event extraction is that it not only captures the topological relationships between the individual event nodes but also identifies the associations among the subgraphs for complex events.

The rest of the paper is organized as follows; Sect. 2 details the proposed materials and methods with a complete overview of the MLG model used in this study. Section 3 depicts the results and discussion followed by conclusions and future perspectives in Sect. 4.

2 Methods

The event extraction system presented in this study has three subtasks, namely (i) text preprocessing, (ii) event identification and (iii) argument detection. In text pre-processing, we applied general steps such as text preparation and cleaning, recognition of gene and protein mentions, dependency parsing of event sentences. In the event identification phase, we used two kernels, namely, a baseline feature-based kernel which uses token-based features, sentence-based features, parsing features, domain-specific features and the Multiscale Laplacian Graph (MLG) kernel, which uses the multilevel topological relationships between the event nodes as features. Both the feature-based kernel and the MLG kernel were combined using ensemble SVM for event identification. Finally, in the argument detection phase, we used lexico-syntactic patterns to detect arguments of the events. The overall schematic architecture of our event extraction pipeline has been depicted in Fig. 1 and each subtask is described detail in the following subsections.

Fig. 1
figure 1

Overall schematic architecture of the proposed event extraction system

In our methodology, we considered the nine most crucial events from BioNLP-ST [16,17,18], which are commonly used in existing studies. The nine types of events are merged into three main classes. The first five (Gene Expression, Transcription, Protein catabolism, Phosphorylation, Localization) had only one argument (theme: protein) and these events are called simple events. The second class of binding events involved more than one argument (two themes: proteins). Finally, the regulated events (Regulation, Positive regulation, Negative regulation) had two arguments: a theme and cause (event or protein).

2.1 Text pre-processing

2.1.1 Text preparation and cleaning

With a specific end goal to set up the corpus for extracting the events from it, the following preprocessing steps were carried out. They consisted of tokenization, sentence segmentation, POS tagging, lemmatization, and chunking. OpenNLP [59] was utilized for sentence splitting, tokenization, POS tagging, and chunking. Lemmatization was done by BioLemmatizer [60].

2.1.2 Dependency parsing

To provide information about grammatical relationships concerning two words extracted from a graph representation of the dependency relations in a sentence, we applied dependency parsing. The advantage of using dependency parsing is to find the grammatical relationships between two words and to find out the syntactic representation of a given sentence. A dependency relation is formalized as a direct grammatical relationship including two words (headword and dependent word) and a sentence is represented as a graph of dependency relations [61]. Dependency related features played an important role to extract the biomedical events. Here, we used two dependency parsers: the Stanford Dependency Parser (SDP) [62] is used to compute the universal dependencies and the GENIA Dependency Parser (GDep) [63], for the generation of the dependency graph of the sentence. Figure 2 depicts the dependency parse for a simple sentence. Here we can see that binary relations between common nouns such as transcription, gene, activity with adjectives and prepositions like binding, in and c-jun were identified. The given sentence explains Leukotriene B4 stimulates the transcription of genes c-fos and c-jun and activity AP-1 binding in human monocytes. The dependency parser identified transcription, gene, Leukotriene, activity as NN (noun, singular), AP-1 as CD (cardinal number), and monocytes as NNS (noun, plural). The dependency parser also identified the grammatical relations within the sentence using amod (adjectival modifier), dobj (direct object), pobj (object of preposition), conj (conjuction), and prep (preposition).

Fig. 2
figure 2

Dependency parsing for a simple sentence

2.1.3 Named entity recognition (NER)

The next step in our approach is the recognition of gene/protein mentions in the event sentences. To extract the events with high accuracy, named entities play an important role, since they came in the theme-cause role. NER is the process of detecting entities such as genes, proteins, diseases, species, RNA, cell, cell line from the text [64, 65]. BCC-NER [66], our in-house hybrid named entity tagger, was used to detect the gene and protein names automatically.

2.2 Event identification

Next, for event identification, we used an ensemble machine learning based classification approach with two kernels, namely feature-based kernel and MLG kernel. The feature-based kernel uses token-based features, sentence-based features, parsing features, and domain-specific features. The Multiscale Laplacian Graph Kernel (MLG) [53] uses the multilevel topological relationships between the event nodes as features. Both the feature-based kernel and the MLG kernel were combined using ensemble SVM [67] for event identification.

2.2.1 Feature-based kernel

In the baseline feature-based linear kernel, we used a total of 15 features broadly classified into four feature categories, namely token-based, sentence-based, parsing and domain-specific features which were employed successfully in a previous bio-event extraction task [68,69,70]. All 15 features are category wise grouped and illustrated in Table 1. The detailed feature representations for generating feature-based kernel model are clearly explained in Supplementary file S3.

Table 1 Category wise features used in feature-based kernel

2.2.2 Multiscale Laplacian Graph (MLG) kernel

Recently graph-based approaches for relation extraction are getting increased attention for their ability to capture both syntactic and semantic structures, thereby enabling deep understanding of the complex sentences such as bio-events and achieving state-of-the-art performances [41]. To improve the performance of the bio-event extraction task we employed the MLG kernel [53] along with the baseline feature-based kernel in our approach. The MLG kernel [53] is briefly introduced below and it is constructed based on two graph kernels, namely (i) Laplacian Graph kernel (LG), (ii) Feature space Laplacian Graph Kernel (FLG). The implementation of the MLG kernel is available at https://github.com/horacepan/MLGkernel.

Laplacian Graph (LG) Kernel: Consider graph \( G \) as the weighted undirected graph with vertex set \( V = \left\{ {v_{1} ,v_{2} \ldots v_{n} } \right\} \) and the edge set \( E \). The graph Laplacian [75] is a positive semi-definite matrix and it can be represented using adjacency matrix \( A \) and weighted degree matrix \( D \). The Laplacian matrix of the graph can be expressed using the notation \( L = D - A \).

The LG kernel of two graphs (\( G_{1} ,G_{2} \)) can be defined by the following equation.

$$ k_{\text{LG}} \left( {G_{1} ,G_{2} } \right) = \frac{{\left| {\left( {\frac{1}{2}S_{1}^{ - 1} + \frac{1}{2}S_{2}^{ - 1} } \right)^{ - 1} } \right|^{1/2} }}{{\left| {S_{1} } \right|^{1/4} \left| {S_{2} } \right|^{1/4} }} $$
(1)

where S1 = \( L_{1}^{ - 1} + \lambda \) I, S2 = \( L_{2}^{ - 1} + \lambda \) I.

The \( L_{1}^{ - 1} \), \( L_{2}^{ - 1} \) are the inverse of the graph Laplacian and I is the identity matrix with parameter λ, these are used to obtain the similarity between the graphs \( G_{1} ,G_{2} \).

Feature Space Laplacian Graph kernel (FLG): FLG kernel was used to compare the structure of the subgraphs in a single scale. FLG unites the information attached to the vertices with the graph Laplacian. The advantage of employing the FLG kernel is to transform the vertex space variables \( a_{1} ,a_{2} \ldots a_{n} \) into feature space variables \( b_{1} ,b_{2} \ldots .b_{n} \), where \( b_{i} = \mathop \sum \nolimits_{j} t_{i,j} \)(\( a_{j} ) \) and each \( t_{i,j} \) only depend on j during local and reordering the invariant possessions of vertex vj and the resulting kernel should be permutation invariant. Vertex space variables are the input variables that can be used to transform graph vertex as the feature vertex. Consider \( G_{1} ,G_{2} \) as the two graphs with regularized Laplacians \( L_{1} \) and \( L_{2} \), and we define the parameter λ ≥ 0 and (Φ1,…,Φm) is a collection of m local vertex features and they define the feature mapping matrices in the FLG. The FLG kernel is defined as follows.

$$ k_{\text{FLG}} \left( {G_{1} ,G_{2} } \right) = \frac{{\left| {\left( {\frac{1}{2}S_{1}^{ - 1} + \frac{1}{2}S_{2}^{ - 1} } \right)^{ - 1} } \right|^{1/2} }}{{\left| {S_{1} } \right|^{1/4} \left| {S_{2} } \right|^{1/4} }} $$
(2)

where S1 = \( {\text{U}}_{1} L_{1}^{ - 1} U_{1}^{T} + \) λ I, S2 = \( U_{2} L_{2}^{ - 1} U_{2}^{T} + \) λ I

Here \( U_{1} \) and \( U_{2} \) are the feature mapping matrix, \( L_{1 } \) and \( L_{2} \) are the Laplacian matrix and I is the identity matrix with parameter λ and transpose \( U_{1}^{T} \),\( U_{2}^{T} . \) The major limitation of the FLG kernel is that it cannot consider graph structure at multiple different scales which paved the way for the MLG kernel. The FLG kernel acts as the key component in the MLG kernel and it is applied recursively for the construction of MLG.

Multiscale Laplacian Graph (MLG) Kernel: The MLG kernel for a graph (G) can be computed as follows:

  1. (i)

    The graph (G) is divided into a large number of smaller subgraphs, and the FLG kernel is computed between any two subgraphs for the similarity calculation in single scale.

  2. (ii)

    A new kernel (FLG) is calculated between the vertices by placing the extracted subgraphs to a random vertex of the graph G.

  3. (iii)

    Finally, a new FLG kernel is computed between the large subgraphs of the graph (G) based on step ii and this process is repeated L (multiple scales) times.

The MLG kernel thus constructed as follows:

Consider G as the graph with vertex set V, and compute the kernel k as a positive semi-definite kernel on the vertex set V. For each vertex (\( v) \) in the vertex set V (\( v \in V) \) we have a nested sequence of L neighborhoods.

$$ v \in N_{1 } \left( v \right) \subseteq N_{2} \left( v \right) \subseteq \cdots \subseteq N_{L} \left( v \right) \subseteq V $$
(3)

Consider \( G_{l} \)(\( v \)) as the corresponding subgraph for each \( N_{l} \)(\( v \)). From the above equation, the Multiscale Laplacian subgraph (MLS) kernel can be defined by calculating multiple FLG kernels for vertex set V as (k1…kL: \( {\text{V }} \times {\text{V }} \to {\text{R }} \)).

$$ k_{1 } \left( {v,v^{\prime}} \right) = k_{\text{FLG}}^{k} \left( {G_{1} \left( v \right),G_{2} \left( {v^{\prime}} \right)} \right) $$
(4)

\( k_{1 } \) is the FLG kernel (\( k_{\text{FLG}}^{k} \)) generated from the base kernel \( k \). Here, the base kernel \( k \) is used to boost the FLG to multi-scale kernel.

$$ k_{l } \left( {v,v^{\prime}} \right) = k_{\text{FLG}}^{{k_{l - 1} }} \left( {G_{l} \left( v \right),G_{l} \left( {v^{\prime}} \right)} \right) $$
(5)

where \( l \) = 2, 3…L, and \( k_{l } \) is generated from \( k_{l - 1} \) kernel.

Let G be a set of graphs as a chance to be an accumulation of graphs with the end goal that all their vertices are members of an abstract vertex space V supplied with a symmetric positive semi-definite kernel k : \( V \times V \to R \). Assume that the MLS kernels \( k_{1 } \),…,\( k_{ L} \) are characterized in Eqs. 4 and 5 both for pairs of subgraphs inside the same graph and crosswise over pairs of different graphs. Now the MLG kernel can be structured as follows

$$ k\left( {G_{1} ,G_{2} } \right) = k_{\text{FLG}}^{\text{LG}} \left( {G_{1} ,G_{2} } \right) $$
(6)

In this study to implement the MLG kernel, we generated Universal Dependencies (UD) along with the adjacency matrix of the bio-event sentences.

Universal dependencies: We applied Stanford parser for generating UD of the sentences [62]. The grammatical relations of UD are described in a hierarchy, rooted in the most generic relation dependent. In this study, we applied UD in all event sentences to extract the typed relation across the sentence, especially with trigger words and entities.

Adjacency matrix: The generated UD of biomedical event sentences was used to create an adjacency matrix, to represent the association between words. An example UD generated and corresponding adjacency matrix for a sample sentence (PMCID: 1310901) is shown in Fig. 3a, b, respectively.

Fig. 3
figure 3

Universal dependencies and adjacency matrix for the example sentence (a) universal dependencies, (b) adjacency matrix

Subgraph mining

In the MLG kernel, the subgraph mining process was essential to scale the event sentences at multiple levels. The aim of this graph kernel is to find the local structures that are critical at specific position of the graph and find global property that roughly summarizes the graph. In order to do so, MLG kernel is defined as a graph kernel that can consider structure at multiple scales, by comparing graphs by subgraphs recursively. The underlying procedure is that, two graphs are compared by subgraphs, in the next iteration two subgraphs are compared by smaller subgraphs and so on. The MLG kernel uses node features to capture the global structure and induced feature vectors by similarity scores for comparing structures at multiple scales. Recursive approach compares the same subgraph pairs multiple times by calculating the similarity scores on smaller neighborhood. In this study, we created the graph using Universal Dependencies (UD) along with adjacency matrix. The subgraph mining was carried out using the following procedure. (i) First, assign the node degree to the entire graph-structured event sentence. (ii) Construct the subgraph from the large graph. (iii) Design a larger subgraph for the event sentence. (iv) Assign the low-rank approximation approach to entire subgraphs and each larger subgraphs.

2.2.3 Ensemble classification

In the biomedical domain, ensemble classification plays a vital role in improving overall performance for tasks such as article classification [76, 77] and relation extraction [6, 78]. SVM with an ensemble learning approach productively learns multiple training models through lowest time complexity. In the EnsembleSVM [67], bootstrapping strategy was employed to repeatedly learn the training models and aggregates the multiple training instances into the single predicted model. In this study, we employed EnsembleSVM [67] to generate the ensemble models for feature-based linear kernel and MLG kernel and merge them to a single classification model to efficiently categorize the events. Using EnsembleSVM we created models on bootstrap subsamples and trained ensembles of SVM models for feature based and MLG kernel, respectively. Figure 4 depicts a detailed explanation of the ensemble classification pipeline of our approach.

Fig. 4
figure 4

An example of an ensemble classification pipeline of the two kernels

Ensemble classification of our approach described in Eq. 7:

$$ E_{k} = F_{k} + G_{k} $$
(7)

Here,\( E_{k} \),\( F_{k} \) and \( G_{k} \) were the kernel models in our classification problem. Using the “validation set”, we tuned various parameters using the grid search method in our model generation. In the features section, char n-gram was set to 3 and prefix/suffix feature assigned as two-character. In the MLG kernel model, parameters were optimally generated and finally set as the radius to 3, levels to 4, eta to 0.1, gamma to 0.01 and threads to 32. The tree value parameter grow was set to 1 to grow by leaf radius. This is for allowing the subgraphs to double in size at each level. We kept all these parameters to their default values during the model development.

2.3 Argument detection

After the identification of events and triggers, the next step is to extract arguments, which describe the events. To extract arguments for the events from the text efficiently and accurately, we used the lexico-syntactic pattern-based approach with semantic role labeling [79] which is briefly introduced below.

2.3.1 Lexico-syntactic pattern and semantic role-based rules engine

Lexico-syntactic patterns [80] are generalized linguistic structures for extracting related concepts and relationships between concepts from the text. Here the trigger words and propositions (synonym, subject, and verb) were the concepts and relationships to detect the event arguments. Lexico-syntactic patterns were used to structure the ontology of the words. Motivated by the work of Hung et al. [79], we employed lexico-syntactic patterns to identify {THEME, CAUSE} of the events. In the current study to identify arguments from the events, a combination of lexico- syntactic patterns and semantic matching were performed through three steps, namely contextual patterns, semantic role labeling, and event-specific argument structure, respectively. The list of bio event cues and trigger word list were used to match the arguments using pattern matching and role labeling. In the event-specific argument structure phase, post-processing rules were incorporated such as emphasizing event certainty and co-reference mentions. A detailed description of our lexico-syntactic pattern-based rule engine is depicted in Fig. 5. A brief explanation about each step incorporated in the rule engine with an example is discussed below.

Fig. 5
figure 5

Lexico-syntactic and semantic role based rule engine for argument detection phase

Contextual patterns

Contextual patterns (CP) utilize domain specific information such as a trigger word list and tagged entities to annotate possible event arguments. The contextual patterns were employed with the following two components: subclass and complex. Subclass was utilized to detect and annotate the trigger word list and tagged entities using the pattern keyword (VP list, VP, NP) from the dependency parsed sentences. These patterns were also used to detect the prepositions (to, belong, with, without, etc.) between the trigger words and proteins. Tagged entities are represented in the sentence as ‘protein 1’ and ‘protein 2’ etc. For example: interact with Protein 1 and Protein 2. Complex Patterns were employed to identify the verb keywords which indicate multiple arguments of the same events. For example, protein1 interacts with protein 2 which catalyzes protein 3 and causes protein 4 downregulation. The above sentence contains multiple events which is represented by the cue words ‘catalyzes’, ‘interacts with’, ‘cause’. A full example of contextual pattern identification is shown below. For example: (PMID:9973520)

Rule 1:

Scenario: Identifying the subject and verb in a given sentence.

Original sentence: Cross-linking of CD44 on rheumatoid synovial cells up-regulates VCAM-1

After applying CP: NPCross-linking VPup-regulates etc.,

Contextual patterns identify and annotate the possible trigger words and entities by utilizing the trigger word list in the sentence, which will be processed further by applying semantic role labeling techniques described below.

Semantic role labeling

Semantic role labeling (SRL) is a process in natural language processing to determine the relationship between the verb and syntactic structure of a sentence [79]. In our approach, semantic role labeling was used to search and determine the association between protein entities and trigger words in a sentence. It involves the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles. Here we have taken the sentence (A), Verb (VP), modal verb or preposition (M), participants in the sentence (P) and S as the subject of the sentence. From the above steps, we derived a semantic role labeling approach for our event arguments construction process.

For example, in the following sentence (PMID:9973520) “Cross-linking of CD44 on rheumatoid synovial cells up-regulates VCAM-1” the trigger word was “up-regulates”, the arguments were CD44, VCAM-1 and rheumatoid synovial cells and they participated in the event positive regulation. This was identified by applying a set of rules; the procedure of the same has been given in Table 2 in detail.

Table 2 Rules used for detecting the arguments of the events

Event - specific argument structure

After extracting event arguments using SRL based patterns, we incorporated two post-processing rules as event-specific argument structures to raise the performance of our event argument detection approach. Event-specific argument structure was used to differentiate simple and complex events that specify the arguments directly or indirectly in the tagged sentences.

  1. (i)

    Searching for a connective pronoun such as “it” in the sentence which indicates the entity names (Protein) in the below example.

Examples of the generated rules: (Here D0, D1-Dependencies, ARG1, ARG2-Arguments of the particular word in the sentence)

it_D0_Arg1_Arg2

both_D0_ D1_Arg1_Arg2

that_ D0_D1_Arg2_Arg1

Example:

PMID: 10209041

Expression of GrpL is restricted to hematopoietic tissues and <Keyword> it </Keyword> is distinguished from Grb2 by having a proline-rich region.

In the above example the pronoun ‘it’ denotes the protein GrpL, and it participated in the event ‘Expression’.

  1. (ii)

    Searching for specific keywords such as ‘certainly’, ‘highly’, ‘confirm’, which were co-mentioned with trigger words ‘activation’ or ‘up-regulation’ so that event-specific meaningful sentence can be identified rather than a generalized one. This is also used to identify specific trigger words, which describe the event accurately from multiple trigger words in the same sentence.

Examples of the generated rules:

highly _D0_Arg1, probably_D0_D1_Arg1, certainly _D0_D1_Arg2, confirm _D0_Arg1 etc.,

<Keyword>confirm</Keyword><D0>that</D0><ARG1>binding</ARG1><D1>of </D1>endogenous <ARG2>NFkappaB</ARG2> and <ARG3>AP1<ARG3>.

Example:

PMID: PM9190901

We < Keyword > confirm < /Keyword > that binding of endogenous NFkappaB and AP1 is induced following PMA/ionomycin treatment of T cells.

In the above example, the keyword ‘confirm’ described the certainty of the event ‘binding’.

The analysis of training data was used to makeup the lexico-syntactic pattern-based rule engine to detect the participating themes in the events. We developed a pattern matching module using Java Regex [22,23,24] coupled with the above process to detect the arguments in the event classes.

3 Results and discussion

3.1 Dataset

For the first time, BioNLP-ST-2009 [16] introduced three tasks based on the GENIA corpus [20] for the detection of core events, recognition of event arguments and negation/speculation detection. In BioNLP-ST-2011 [17], the tasks were expanded with resources to capture more text and event types. In BioNLP-ST-2011, the GENIA Event extraction (GE) task has been kept and augmented with three focused event tasks, namely (i) epigenetic and post-translational modification (EPI), (ii) bacteria biotope (BB) and bacteria interaction (BI) and (iii) infectious diseases (ID) [17]. Application domains were further expanded in BioNLP-ST-2013 [18] while keeping the GE and BB; the additional tasks were cancer genetics (CG), gene regulation ontology (GRO), and pathway curation (PC).

To assess the performance of our approach, we employed four different corpora which includes three corpora from BioNLP-shared task (BioNLP-09 [16], BioNLP-11 [17], BioNLP-13 [18]), and one another standard corpus, namely GENIA-MK (Meta-knowledge) [21] which is currently available and widely used for event extraction tasks. All the four corpora were used to train and test the models of our approach. The corpus statistics of all three BioNLP-ST corpora and the GENIA-MK corpus are represented in Table 3.

Table 3 Corpus statistics (Abs—Abstract, Full—Full text articles)

3.2 Evaluation metrics

Evaluation of our event extraction system was performed based on standard evaluation metrics precision (P), recall (R), and F-Score (F). The shared task online evaluation server was used to perform the evaluation of the BioNLP-ST (2009, 2011, 2013). The results reported in our system are based on Approximate span matching and Approximate string matching evaluation measures. For the GENIA-MK corpus evaluation, we used 10-fold cross validation. In the 10-fold cross validation the GENIA-MK corpus was divided into 10 subsets. Every run, 90% of the data was used as the training set, and the remaining 10% was used as the test set.

3.3 Evaluation results

We trained and tested our approach on BioNLP-ST 2009, 2011, 2013 and GENIA-MK corpus with the Feature-based linear kernel, MLG kernel, and Ensemble kernel. Following training and testing, approaches were carried out to assess the performance of our approach.

In Table 4, first, we implemented the ensemble feature-based approach on the BioNLP-ST-2009 corpus. By analyzing Table 4 feature-based approach results in high precision and low recall. Next, we deployed the ensemble MLG kernel-based approach to the corpus, and it results in high precision and high recall and moderately increases the F-score. Finally, we combined both ensemble kernels, which takes the benefits of both feature-based and MLG kernel-based output models and attained the comparative F-score. Likewise, we applied the above methods in BioNLP-ST-2011 and BioNLP-ST-2013 corpus.

Table 4 Results on BioNLP-ST Corpora

In Table 5, we implemented the same approach on the GENIA-MK corpus. Experimental results show that our approach attained the best results compare to the BioNLP-ST corpora. Figure 6 depicts the Receiver Operating Characteristic (ROC) curve of the three kernels for all four corpora.

Table 5 Results on GENIA-MK Corpus
Fig. 6
figure 6

ROC plotting results for four corpora a BioNLP-ST- 2009, b BioNLP-ST-2011, c BioNLP-ST-2013, d GENIA-MK

To classify the events individually, every event type needs a variety of features to reflect the diverse context and linguistic characteristics. For example, compared to the events such as gene expression, transcription, localization, the regulation events need more token-based, concept based and syntactic information. By the implementation of a feature-based approach in our study, we properly modeled the higher complexity associated with their phrasal and linguistic contexts and consequently prepared our model to identify the individual events. Next, the feature-based approach was coupled with MLG kernel that takes advantage of both feature-based and graph-kernel based approaches and generated state-of-the-art performances in the extraction of individual classes of events. Table 6 shows results for individual classes of events in the four corpora by employing our ensemble approach.

Table 6 Results for individual classes of events on BioNLP-ST 2009, 2011, 2013 and GENIA-MK

Next, we compare our approach with other state-of-the-art approaches developed on the BioNLP-ST 2009, 2011, 2013 and GENIA-MK corpora. Comparisons show that our proposed approach performs better than other state-of-the-art approaches. Tables 7 and 8 show the comparisons.

Table 7 Comparative analysis on the BioNLP-ST CORPORA (BioNLP-ST-09, BioNLP-ST-11, and BioNLP-ST-13) based on F-score (%)
Table 8 Comparative analysis on the GENIA-MK Corpus in terms of F-score

3.4 Discussion

In our methodology, we implemented a Feature-based linear kernel, MLG kernel, and lexico-syntactic pattern-based approaches to extract biomedical events with unique steps. Some interesting findings encountered from our approach are discussed below. The baseline feature-based linear kernel captured grammatical, syntactical, morphological, orthographical, and sentence level global information successfully. Morphological and orthographical features were used to describe the structure of the word in a given sentence. Parsing features were employed to discover the grammatical and syntactical expressions of the event sentences. Packing these above features and methodologies in the feature-based linear kernel gave the perfect baseline to extract the events from the biomedical literature.

The MLG kernel was used to compare the structure of the graph at multiple different scales. Mining subgraphs is an important phase in the MLG kernel because each generated subgraph will be compared by its constituent sub-subgraphs. MLG kernel first accepts a universal dependency structure, in which a direct dependency relationship path between the trigger words and named entities. MLG kernel combines baseline graph Laplacian kernel with feature representations originating from nested neighborhoods. Finally, MLG kernel considers both overall and local graph structures to learn similarities at multiple different levels. By considering all this we believe that by employing MLG kernel, our system was not only able to capture the topological relationships between the individual event nodes but also identifies the topological relationships between the subgraphs.

Example:

PMCID 1310901

Sentence: Down regulation of interferon regulatory factor 4 gene expression in leukemic cells

In the above example, words in the sentences were converted to universal dependencies and then to the adjacency matrix. The MLG kernel first assigns the node degrees based on UD and adjacency matrix to the graph generated for the sentence. In our case, in this example, words like expression and factor were assigned with high node degree.

In general, the graph structure is captured at multiple scales in MLG. This is achieved by increasing the depth of the neighborhood vertices in the graph. In addition, MLG focuses on capturing the neighborhood similarity among the vertices and uses this similarity score to induce the feature vectors. The current study exploits the above technique in which the biomolecular event sentence is searched at multiple scales for finding the relations between events and the target proteins using the graph generated from the corresponding adjacency matrix of the sentence. An interesting connection to be noted is that the cue words like gene and expression, regulation and factor, leukemia and cells were connected in the graph. In the following steps, a subgraph mining from the sentence graph followed by the building of larger subgraphs was performed. As a result of this step, words like interferon, regulatory, factor, gene, expression are added into a single subgraph. So, we strongly believe that our subgraph mining based MLG kernel played an important role in capturing the key information about the biomedical event sentences.

The association among the subgraphs for complex event extraction using MLG kernel is represented in Fig. 7 for a sample sentence from PMID 1335418. From the sentence, the MLG kernel first detects the small subgraphs in level one as entity names and event trigger words (For example, cAMP and accumulation). In level two, the kernel identifies the relationship between the trigger word and the corresponding proteins by accumulating multiple subgraphs (activation, cells, jkat, protein, kinase, cells). Finally, in level three the larger subgraphs were mined, thereby identifying the complex event. The repeated subgraph mining process was done until the low-rank approximation was observed to improve the classification accuracy.

Fig. 7
figure 7

Extraction of complex events by identification of association among subgraphs in MLG kernel. (The rectangle shape green represents trigger words and blue represents proteins. The dotted circles in various colors violet (Level 1), red (Level 2), blue (Level 3) represents each level of subgraph mining) (color figure online)

Example:

PMID: 1335418

We have earlier found that in Jurkat cells activation of protein kinase C (PKC) enhances the cyclic adenosine monophosphate (cAMP) accumulation induced by adenosine receptor stimulation or activation of Gs.

Next, in argument detection, we employed lexico-syntactic based semantic role labeling and contextual pattern-based rules to extract the event arguments efficiently. Lexico-syntactic patterns were used to detect domain-specific ontology-based concepts and relationships effectively. In the event extraction task, lexico-syntactic patterns with semantic role labeling process require significantly less time to compare normal lexico-syntactic patterns. The examples were illustrated in detail in the methods Sect. 2.3.1. A few interesting advantages of using lexico-syntactic patterns to event argument detection are illustrated in the following examples.

Example 1:

PMID: 10330189

In response to activation of the Wnt signaling pathway, beta-catenin accumulates in the nucleus, where <Keyword> it </Keyword> cooperates with LEF/TCF (for lymphoid enhancer factor and T-cell factor) transcription factors to activate gene expression.

In the above example 1 the pronoun “it” denotes the protein beta-catenin, and it participated in the events “gene expression” and “transcription”.

Example 2:

PMID: 10087185

Induction of NFkappaB is a <Keyword>highly</Keyword> regulated process requiring Phosphorylation.

In the above example 2, the keyword “highly” denoted the certainty of the event “Phosphorylation”.

The event-specific argument structure based syntactic rules were applied after contextual patterns and semantic role labeling to detect arguments. The event-specific argument structures-based rules acted as post-processing and improved the performance of the argument detection phase.

Even though our system performs well, it exhibits some limitations. The major source of errors that occurred in the argument detection phase is concerned with events containing multiple arguments. If the event contains more than three arguments, those types were difficult to extract. For example (PMID: 1313226), in the sentence “Leukotriene B4 stimulates c-fos and c-jun gene transcription and AP-1 binding activity in human monocytes”. The event regulation contains more arguments and simultaneously it consisted of other events also as an argument.

4 Conclusions and future enhancements

In this paper, we deployed a hybrid system by combining methodologies such as the ensemble feature, graph-based kernels along with lexico-syntactic patterns to extract biomedical events from the literature. Our Multiscale Laplacian Graph (MLG) kernel-based approach can detect the topological relationships between events nodes in multiple scales and identifies the associations among the subgraphs for complex events. To the best of our knowledge, we are the first ones to introduce the MLG kernel for event extraction task. Since features play a crucial role in supervised machine learning, especially in event extraction a wide variety of features represented broadly as token-based, sentence based, parsing and domain-specific features to generate a feature-based kernel. Finally, we combined both ensemble kernels to generate a robust event classifier. In addition, in the argument detection phase we employed lexico-syntactic based semantic role labeling and contextual pattern-based rule engine to extract the event arguments. We incorporated contextual patterns, semantic role labeling, and event-specific argument structure to detect the domain-specific ontology-based concepts and relationships effectively. In the future, we plan to employ the automatic feature extraction approaches, advanced universal dependencies and different coefficient pair for kernel ensembling to extract the events from the literature, and we will apply this system in various biological relation extraction approaches such as Chemical Induced Disease (CID), Disease-Drug Interactions (DDIs) and Protein–Protein Interactions (PPIs).