Keywords

1 Introduction

A knowledge graph is designed to describe the entities and relationships of a domain. Domel [1] introduced the idea of a Web map to provide navigation support for hypertext browsers. Google leveraged a knowledge graph to improve the quality of search results by building semantic relationships in information extracted from multitude of heterogeneous sources [2]. The Google knowledge graph was based on data from Wikipedia, Freebase [3] as well as public databases. Microsoft developed the Bing search engine based on the Satori knowledge base, which can provide a variety of search services of Web, videos, images and maps. Pujara et al. [4] proposed knowledge graph identification based on ontology aware partitioning to obtain better results. Arnaout and Elbassuoni [5] proposed a general framework that extended both the search knowledge graph and triple-pattern queries for effective searching of RDF knowledge graphs. Fionda et al. [6] introduced the formalism regarding the Web of linked data. Ontology is important to help domain experts regulate and annotate knowledge in their fields. Ontology is a philosophical theory, and it defines a set of representational primitives to model domain knowledge or discourse in the context of computer and information sciences [7]. It is a specification for modeling concepts, an abstract model describing the domain and a formal definition of the concepts and their linkages. It includes class (concepts), slots (roles or properties) and facets (role restrictions). Ontology encapsulated with individual instances of classes comprises a knowledge base [823].

Further, in the recent decade, we are seeing an enormous growth in the volume of data being generated from a wide spectrum of areas. With an increased amount of devices being connected to the Internet, data growth is expected to grow dramatically in the coming years. As data generation and consumption increases, it is vital to plan how the data should be stored to make data processing efficient. Based on how data is stored, it can be categorized as follows:

  1. (a)

    Structured data: This type of data is typically stored in a relational database that follows a schema like MySQL and Oracle Database, which makes it easier to analyze.

  2. (b)

    Semi-structured data: While this type of data is not stored in a relational database, it still follows some organization and structure such as XML format.

  3. (c)

    Unstructured data: This kind of data does not follow any schema or structure and is often found in sources like word documents, PDF and Blogs. It is the hardest to analyze and requires Natural Language Processing (NLP) algorithms to process.

When data contains relationships between various entities of a domain, it is important that this be reflected in the way data is stored and processed to serve applications. Knowledge graph (KG) is a structured form to capture these relationships and entities from various structured and unstructured data sources. While KG has been gaining traction on open-source datasets that contain information found on the Web, there have not been significant efforts to build KG for applications in industries. Industrial applications can significantly benefit from having a KG representation of data. For example, a typical software application containing several modules, classes and functions with inter-dependencies can leverage a KG representation.

2 Knowledge Graph Development Pipeline and Components

The quality of KG depends on the nodes and relationships extracted from the data. KG development typically consists of two major phases: knowledge extraction and knowledge completion. Knowledge extraction consists of the tasks NER and RE. For example, consider the sentence: Barack Obama was born in Honolulu. In this sentence, the named entities can be classified as: {Barack Obama: Person}, {Honolulu: Location}, and the relation between the two entities is {Barack Obama, born in, Honolulu}. Figure 1 depicts the step-by-step knowledge graph development pipeline from structured/semi-structured (S.S) and unstructured (U.S) data. The details of each KG component are discussed in the next section.

Fig. 1
figure 1

Knowledge graph development pipeline

3 Knowledge Extraction

3.1 Entity Extraction

The first main task in the pipeline to build a KG is to extract the named entities from the data. The choice to select the appropriate algorithm for a NER task depends largely on the amount of labeled data available to train the models. Some of the popular algorithms to perform named entity recognition (NER) are explained below.

  1. A.

    Conditional Random Fields (CRFs)

CRF is a discriminative undirected graphical model that learns the conditional probability distribution for the various classes of named entities in the data [9]. It treats the input data to be sequential and takes into consideration surrounding context while making a prediction for a given word. In comparison with Markov random fields which model the prior and likelihood, CRF computes the posterior probability between the class labels and inputs directly. The size of the context window determines the order of the CRF model. For example, if the previous word label is also considered while making the prediction of the current word, it will be a first-order CRF. If two previous words are considered, it will be a second-order CRF. The standard linear chain CRF model equations are as follows:

$$p(\textbf{y}|\textbf{x}) = \frac{1}{{Z(\textbf{x})}} \prod\limits_{t=1}^{T}\exp\left\{ \sum\limits_{k=1}^{K} \theta_{k}f_{k}(y_{t},y_{t-1}, \textbf{x}_{t} ) \right\}$$
(1)
$$Z(\textbf{x}) = \sum\limits_{\textbf{y}} \prod\limits_{t=1}^{T}\exp\left\{ \sum\limits_{k=1}^{K} \theta_{k}f_{k}(y_{t},y_{t-1}, \textbf{x}_{t} ) \right\}$$
(2)

where y: prediction labels of the sentence, x: sequence of words in the sentence, T: total number of words in the sentence, K: total number of feature functions, yt: label of word at position t, yt−1: label of word at position t − 1, xt = word at time t, fk: kth feature function, Z: normalizing function that sums the probabilities over all possible label assignments, θk: weight associated with the kth feature function

Each word is represented by n-dimensional feature vectors. These feature vectors can either be manually designed or derived from word embedding models. Feature functions play the most important role in the decision making of the CRF model. They take as input the feature vector of the word, the label of current word and the label of the previous word. For example, consider the feature function: f(yt, yt−1, xt) =1 if yt = Location and yt−1= Person; f(yt, yt−1, xt) = 0 otherwise. In this case, a positive weight for this feature function will signify that an entity of type Location will typically follow an entity of type Person. A CRF model will consist of several such feature functions and learn weights for each feature function in the training phase. The parameters θ for the CRF model are learned using Maximum A Posteriori (MAP) estimates with the help of optimization algorithms like stochastic gradient descent [10, 11].

  1. B.

    Bidirectional Long Short-Term Memory (BiLSTM) Neural Networks

Natural language text is sequential in nature and exploiting this characteristic was the main motivation behind recurrent neural networks (RNNs). RNNs are sequential neural network models that take into account words that have occurred until the point of the word for which the prediction has to be made. With this property, the prediction for a word in a sentence will be influenced by all the words that have occurred before it, thereby giving more contexts to the model. RNN models have performed well when the length of the sentences is short. In cases of longer length sentences, RNN’s performance started to decrease significantly due to two major problems:

  1. (i)

    Vanishing Gradient: This happens when the gradient of the loss function with respect to the weights in the initial layer becomes close to zero due to successive multiplications of the gradient values less than zero. This can happen when the activation function is a sigmoid activation function, which outputs a value between 0 and 1. As a result, the neural network weights in the beginning of the network stop learning which causes the output to become immune to words occurring in the beginning of the sentence.

  2. (ii)

    Exploding Gradient: Contrary to the vanishing gradient problem, exploding gradient leads to very large updates in the weights in the neural network. This happens due to successive multiplications of values greater than 1, when using activation functions like ReLU. Exploding gradient can lead to an unstable neural network and even overflow in some cases.

To tackle these problems, LSTM [12] networks are used in NLP applications these days to capture long-term dependencies in a sentence. LSTM networks have gates that regulate the information going in and out of an LSTM cell. The components of an LSTM cell are as follows:

  1. (a)

    Cell state: This is the memory held by each LSTM unit at every time step.

  2. (b)

    Hidden state: This is a regulated amount of cell state passed onto future time steps.

  3. (c)

    Input gate: It regulates the amount of new value computed in the LSTM cell being used for the cell state.

  4. (d)

    Forget gate: It regulates how much of the previous cell state value is retained.

  5. (e)

    Output gate: It regulates the amount of cell state that is allowed to be output from the LSTM cell.

The gates often use a sigmoid activation function as it gives an activation between 0 and 1. For example, if the forget gate value is 0, the entire value of the previous cell state is discarded. The LSTM neural network is able to remember and forget long-term dependencies. As a result, LSTM networks have been widely used to perform tasks like NER where it is important to know the sequence of words. While vanilla LSTM networks only look at words that have occurred before a target word, BiLSTM networks (Fig. 2) look for words occurring both before and after a target word. Having a bidirectional view significantly improves the performance of sequence models. BiLSTM model functions in the same way as a LSTM model except that it computes all the components for the forward pass as well as the backward pass. The hidden states from the forward and backward passes are concatenated and passed on for further computation.

Fig. 2
figure 2

BiLSTM model architecture

3.2 Relation Extraction (RE) and Attribute Extraction (AE)

In most cases, knowledge is represented in the form of a relation triple in a KG. A relation triple is of the form <Subject, Relation, Object> or <Entity, Relation, Entity>. Many popular graph databases that follow the Resource Description Framework (RDF) store information in the form of triples. While the types of relations extracted from a corpus depend on the KG ontology, some of the popular RE algorithms in use today for extracting triples from natural language text are Stanford OpenIE [13] and ClausIE [14]. Further, attribute extraction (AE) is a special kind of relation extraction methodology where each attribute can be represented as a property (in labeled property graphs) or as a relation (in RDF graphs).

  1. A.

    Stanford OpenIE

Stanford OpenIE focuses on handling long sentences by breaking them into short and coherent clauses, which are maximally shortened. This enables the algorithm to have a larger contextual awareness and gives meaningful relation triples to any application consuming it. It uses a multi-class logistic regression classifier that is trained on labeled relations of the Wikipedia corpus, which is then used to generate relations for an unseen corpus.

  1. B.

    ClausIE

ClausIE extracts relations by separating detection of information from the representation in terms of extractions. It uses dependency parsing and a small set of domain-independent lexica. ClausIE needs no training data and is purely domain agnostic in its behavior. It detects clauses in the text and its type to generate meaningful relations.

4 Knowledge Graph Construction

KG construction (also called as completion) is an important phase to make the graph less noisy and more refined. During the graph construction phase, entities and relations might be referred using similar phrases. For example, Vladimir Putin and vLaDImir PUTin are the same entity and should be resolved into a single node in the KG. Graph completion consists of three important phases, namely entity resolution, link prediction and node labeling.

4.1 Entity Resolution

This task aims at finding nodes that are referring to the same entity in the KG and resolves them into a single node. Entity resolution is also referred to as de-duplication. There are several methods to perform entity resolution:

  1. (i)

    String similarity: Using edit distance algorithms like Levenshtein distance, two node labels can be compared, and if the edit distance is below a chosen threshold, the two nodes can be merged.

  2. (ii)

    Embedding similarity: To merge nodes based on semantic similarity, vector embeddings are used to compare two nodes. A vector embedding of each node label can be created and compared using measures like cosine similarity. Vector embeddings are learned using neural networks and can take into account semantic meanings of words. For example, if there are two nodes, namely Russian President Putin and Vladimir Putin, the two nodes will be merged based on their high semantic similarity.

  3. (iii)

    Ontological rules: Based on a KG ontology, the category of the node as well as the relations it has with other nodes in the KG, entity resolution can be performed. For example, consider the following relations:

    R1: President_of (Russian President Putin, Russia)

    R2: President_of (Vladimir Putin, Russia)

    Then, with an ontological rule like:

    If R1 = President_of (A, B) and R2 = President_of (C, D) and B = D, then A = C.

With this rule, Russian President Putin and Vladimir Putin will be merged into a single node.

4.2 Link Prediction

Unstructured text can often contain missing information, which can lead to an incomplete KG. To tackle this problem, making inferences to predict relations between entities will make the KG more informative and complete. Link prediction is often done by following ontological rules. For example, consider the following relation:

  • R1: President_of (Russian President Putin, Russia)

  • Then, with an ontological rule like:

  • If R1 = President_of (A, B), then create R2 = lives_in (A, B) if R2 does not exist.

  • With this rule, a new relation lives_in will be created between Russian President Putin and Russia.

  • Although the above examples of entity resolution and link prediction are straight forward, both these tasks are usually performed via a probabilistic model known as Probabilistic Soft Logic (PSL) [15].

4.3 Node Labeling

Node labeling is the task of appending a new node to the KG by labeling its correct entity class and attributes. PSL is a widely used framework in KG completion tasks. PSL learns the probabilities of rules that will be used in the completion tasks. These rules can have logical operators, arithmetic operators or a combination of both. For example, a PSL rule can be represented as:

$${<}\text{Weight}{>} : {<} \text{Rule} {>} \, 25.0: \, \text{son}\_\text{of} \, \left( {\text{A}, \, \text{B}} \right) \, \& \, \text{son}\_\text{of} \, \left( {\text{C}, \, \text{B}} \right) \to \text{brother}\_\text{of} \, \left( {\text{A}, \, \text{C}} \right)$$

Since a KG completion task depends on the soft logic probabilistic score calculated from a weighted sum of PSL rules, these weights decide the influence of a rule. The weights are usually learned using optimization algorithms. In domain-specific applications based on the underlying ontology of the corpus, these rules form the criteria based on which nodes are merged and new relations are added to the KG.

5 Knowledge Graph Challenges in Sparse Corpus

Some of the notable efforts in KG development include the Google Knowledge Graph, Never Ending Language Learner (NELL) [16], Yet Another Great Ontology (YAGO) [17] and ConceptNet [18]. These have captured several million entities and relations from large-scale datasets like Wikipedia, WordNet, etc. The challenges of building a KG in an industrial setting are multifold: Data is largely unstructured, sparse and contains large amount of domain-specific lexica.

Due to the above challenges, off-the-shelf NLP algorithms perform significantly poor on industrial domain-specific data as compared to open-source datasets like Wikipedia. In the case of NER, while BiLSTM-based models perform extremely well when there is a large amount of labeled corpus, their performance is severely hampered in the case of less labeled and sparse corpus (Fig. 3). CRF-based models perform well even in the case of sparse corpus but require efficient feature engineering to be performed in domain-specific applications found in industries. The performance of the existing RE algorithms also drops when applied on a domain-specific corpus. For example, Stanford OpenIE uses a pre-trained classifier that is trained on labeled relations of the Wikipedia corpus. It uses this trained classifier as a source to generate relations for an unseen corpus. The problem arises when the domain-specific corpus do not follow the same semantic patterns as learned by the classifier. While ClausIE generates a large number of relation triples, they often miss the conditional phrases, which are extremely vital in a domain-specific application. Another challenge with the existing RE algorithms is that they do not take into consideration the underlying ontology of the corpus. As a result, large number of relations are missed or incorrectly formed when the RE algorithms are applied to an industrial corpus. To tackle the above challenges in NER and RE, we have developed improved versions of the CRF model and ClausIE model to handle domain-specific corpus.

Fig. 3
figure 3

BiLSTM performance with respect to training set size on Groningen Meaning Bank (GMB) corpus

5.1 CRF with Automatic Feature Engineering-Based NER Model

As discussed above, the existing CRF models have their limitations when performing a NER task on domain-specific corpus. To handle the sparsity of data, an efficient domain-specific feature engineering should be performed to capture the niche lexica. Manually performing feature engineering is a laborious task and often requires a domain expert to choose the relevant features. Another important task in CRF-based models is the order. For example, a first-order CRF model considers the labels and features of only the word at time step t − 1 in addition to the features of word at time step t. Choosing this order should be based on the n-gram distribution of the named entities found in the corpus. We made the following improvements to the existing CRF model for NER:

  1. (i)

    Order of the CRF model is chosen based on the most frequently occurring n-gram (at word level) in the named entities. For example, if most of the named entities in the corpus are bi-grams (2 words), then a first-order CRF model will be adequate. However, if the entities are mostly tri-gram (3 words), then a second-order CRF model will perform better as the context window increases in making a prediction.

  2. (ii)

    Automatic featuring engineering is done by extracting n-grams (at alphabet level) from named entities which occur in less than ‘C’ named entity classes. These n-grams are extracted from leading and trailing ends of the words. For example, if all the entities belonging to the class ‘PERSON’ begin with the n-gram phrase ‘per_’, then this n-gram feature will be useful in identifying PERSON entities.

The value of ‘C’ is decided based on the total number of entity classes. A small value of ‘C’ is recommended (<3), since the higher the value of ‘C’, the n-gram is a common occurrence across multiple entity classes and hence is not useful in distinguishing between entities of those classes. The recommended method to choosing value of ‘n’ is using heuristics depending on the taxonomy of entities in the domain. The larger the value of ‘n’, more will be the emphasis that will be given on the entire portion of the entity word.

5.2 AugmentedIE

The quality of relation triples is perhaps the most important component in determining the overall quality of a knowledge graph. As discussed earlier, industrial domain corpus poses its own unique challenges, which have to be handled to create a meaningful KG. We made the following improvements to the existing ClausIE model for improving RE:

  1. (i)

    Grouping triples based on ontology: While the existing RE algorithms operate at a sentence level, it is important to consider the ontology of the domain while grouping triples. This leads to a fully connected graph and prevents dangling triples in the KG. Grouping triples is extremely important in an industrial application that consists of hierarchy between the various entity classes. For example, a complex system can have multiple modules, which in turn can have multiple functions, which in turn can have signals and variables. Adhering to such an ontology will help in creating new relations between entities.

  2. (ii)

    Co-reference resolution (Coref): It is very common to see entities being referred in different ways across a domain document. For example, consider the following sentences: Vladimir Putin is the president of Russia. He lives in Moscow. In these sentences, Vladimir Putin and He are the same references. Before generating triples, He must be resolved as Vladimir Putin. Coref significantly improves the quality of triples generated from a corpus. A widely popular Coref algorithm that is used is Neuralcoref [19].

  3. (iii)

    Handling conditional sentences: The existing RE algorithms do not perform well in the case of conditional sentences. Some of the problems of ClausIE on conditional sentences are:

    • Triples miss the conditional phrase.

    • The cause and effect components of the conditional sentence are not coupled.

    For example, consider the following conditional sentence:

    C is ON if A is ON and if B is OFF. ClausIE generated the following triples for it: <C> <is> <ON if A is ON>, <C> <is> <ON if B is OFF>, <A> <is> <ON>, <B> <is> <OFF>.

    As we can see, the triples are noisy and incorrect. The ideal triple will be: <C is ON> <if> <A is ON and B is OFF>.

To tackle these challenges, we have developed a novel algorithm using a Constituency Parse Tree (CPT) to identify the cause and effect phrases from a conditional sentence. CPT breaks down each sentence into a set of clauses with the leaf nodes being Part of Speech (POS) tags for each word. Each causal condition is referred to as a Pre-condition, and each effect is referred to as Post-condition. In the above example, A is ON and B is OFF are Pre-conditions and C is ON is a Post-condition. The algorithms are described below:

CPT pathfinder algorithm steps for finding Pre-conditions:

  1. Step 1:

    Apply Depth First Search (DFS) on the CPT to find a subtree with root node as ‘SBAR’ or ‘PP’. Let this subtree be Tn found at depth n of the CPT.

  2. Step 2:

    Apply Breadth First Search (BFS) for Tn and check if there exists a pair of child subtrees STx,n+1 and STx+1,n+1, where x and x + 1 are the index of child subtrees found at level n + 1, such that root node of STx,n+1 is ‘IN’ or begins with ‘WH’ and root of STx+1,n+1 is ‘S’.

  3. Step 3:

    Check if there exists more than one child subtree STy,n+2 of STx+1,n+1, where y is in range(0, number of subtrees of STx+1,n+1 at level n + 2), such that the root node of STy,n+2 is ‘S’. For each such subtree, STy,n+2 perform Step 4. If none exist, perform Step 4 for STx+1,n+1.

  4. Step 4:

    Apply DFS and create a Pre-condition string appending each leaf node until a subtree is found that satisfies Step 1 and Step 2.

  5. Step 5:

    Step 1 DFS continues.

CPT pathfinder algorithm for finding Post-conditions:

  1. Step 1:

    Apply BFS on the CPT to find pairs of subtrees T’x’,n’ and T’x’+1,n’,with ‘P’ in root node label where n’ is the depth in the CPT and x’ and x’ + 1 are the BFS index, respectively.

  2. Step 2:

    T’x’,n’ and T’x’+1,n’ should not be subtrees of any Pre-condition string tree STx+1,n+1. The leaves of T’x’,n’ and T’x’+1,n’ are appended in sequence to create a Post-condition string. Each such pair of subtrees creates a Post-condition string.

  3. Step 3:

    If there exists an odd-numbered subtrees in Step 1, the leaves of this subtree create the Post-condition string.

Executing the above algorithms on the sentence:

C is ON if A is ON and if B is OFF, we get the following Pre-conditions and Post-conditions: Pre-conditions: (i) A is ON (ii) B is OFF; Post-conditions: (i) C is ON

As we can see, the algorithm successfully captures the cause and effect phrases in the conditional sentence. After we extract the Pre-conditions and Post-conditions, triples can be created as needed depending on the type of conjunction present in the sentence. In the above sentence, a triple can be extracted with the following template:

<POST-CONDITION1> <if> <PRE-CONDITION1 and PRE-CONDITION2>

<C is ON> <if> <A is ON and B is OFF>

6 Industrial Applications

KG can be greatly leveraged in industrial applications where there is a large amount of data containing numerous entities and relationships. Some of the general use cases that KG can serve are Question-Answering (QA) system, recommendation systems, fraud detection. Specifically in the SDLC, some of the applications where a KG can be efficiently used are test case generation, requirement analysis, bug prediction and localization. Since every industrial application goes through a rigorous testing cycle, we have developed an automatic test case generation system called Requirement to Test Case (Req2Test) to overcome the existing test development challenges (Figs. 4 and 5) that leverages a KG to capture domain knowledge (Fig. 4).

Fig. 4
figure 4

Current test development challenges

Fig. 5
figure 5

KG-based Req2Test

The first step in generating test cases is to extract the test intent from the requirement statement. The test intent describes three fundamental components required to create a test case:

  1. (i)

    Pre-conditions: These are all the conditions required to be satisfied before an Action or Post-condition of the requirement statement can be achieved.

  2. (ii)

    Actions: These are all the actions that are performed once the Pre-conditions have been satisfied.

  3. (iii)

    Post-conditions: These are all the conditions that are observed after the Pre-conditions and Actions have been satisfied.

For example, consider the requirement statement: If SignalA is on and ButtonA is pressed, then SignalB is on. The testable intent will be: Pre-conditions: SignalA is on, Actions: ButtonA is pressed, Post-conditions: SignalB is on.

The Req2Test pipeline (Fig. 6) consists of two phases: Phase 1 deals with generating the KG from the text corpus available. Phase 2 deals with processing the requirement statements, extraction of test intent and querying the KG to generate the test case.

Fig. 6
figure 6

KG pipeline for automatic generation of test cases

6.1 Phase 1—KG Creation

The information extraction tool consists of the CPT algorithm discussed above, HOCRF with automatic feature engineering-based NER model and Fasttext vector embedding [24]. The CPT algorithm is executed on the text corpus to represent the entire domain with respect to the ontology (Fig. 7). NER is performed to check if any of the Pre-conditions or Post-conditions are actually Actions. If so, these nodes are assigned to Action only entity. Sentence vector embedding is used to map text found in the requirement statements to the nodes found in the KG, as usually there is a variation in the descriptions found in the requirements and that of the KG. The KG (Fig. 8) is stored in a Neo4j graph database and queried using CYPHER language.

Fig. 7
figure 7

KG ontology for Req2Test

Fig. 8
figure 8

Sample KG for Req2Test in Neo4j Database

6.2 Phase 2—KG Based Req2Test Application

This phase uses the information extraction tool to extract the test intent from the requirement statements. Once the test intent is found, the corresponding nodes are queried in the KG. Since usually the requirements do not mention all the test case conditions, these missing conditions are grabbed from the KG.

7 Summary

Knowledge graph is a powerful tool that can be used to efficiently represent complex systems. While the existing KG has leveraged primarily structured data, new techniques need to be developed in NER and RE to handle semi-structured and unstructured data. The current NER and RE algorithms have challenges when applied to industrial domain applications. We have proposed two algorithms, namely Higher-Order CRF with automatic feature engineering and AugmentedIE, that tackle challenges in off-the-shelf NER and RE models, respectively. There are several deep learning approaches which can be applied to potentially improve the performance of NER and RE models. For example, recent works have proposed leveraging transfer learning, where a pre-trained neural network is used as a starting point and fine-tuned with the less amount of labeled data in the target domain. Another recent advancement in NLP uses attention models to improve the performance of NER and RE models. Attention models can remember long-term dependencies in text and use that to make better prediction on tasks.