Keywords

1 Introduction

The Fault Knowledge Graph (KFG) is based on the KG in the field of faults. Now it has been applied to finance [1, 2], medical [3,4,5], biological [6, 7], agriculture [8, 9], journalism, education, question answering [10,11,12], and other industries. Because KG can display the relationship between various data types and efficient and intelligent search information. However, fault areas such as mechanical engineering [13], trains [14], power grids [15], power equipment [16], etc. All of this raised the need for building KG, but there are not many KGs that actually build success. Each domain has the same place and different places in building KGs. Therefore, this paper investigates the architecture and key technologies of KGs in other fields, and also investigates the construction process and key technologies of KFG. At the same time, KGs in other fields can be used as a guide to complete the construction of KFG. The appearance of the FKG can well analyze the relationship between various faults, achieve prediction, and promote development. This paper provides an overview of the FKG, summarizes its build process and key technologies, and summarizes the open source tools and applications that may be used.

The structure of this paper is as follows:

  • Section 2 outlines the concept of the FKG and the reasons for its development, some key technologies, as well as the application of faults;

  • Section 3 provides the architecture of the general KG and some key technologies, as well as open source tools that may be used to guide the construction of the FKG. These include steps such as data acquisition, information extraction, knowledge fusion, knowledge processing, and knowledge storage;

  • Section 4 analyzes the application of the FKG and the challenges and problems to be solved in the establishment process;

  • Section 5 summarizes the FKG, and predicts the future development of FKG.

2 Overview

In this section, we will describe the origins of the concept of the FKG and some of the reasons that motivated its development. We will also briefly present scope of some of its application fault devices.

The KG [17] was originally proposed by the Google knowledge graph project to enhance the google search engine and enhance the user experience. Later, the KG was applied to many fields with its advantages. However, the application of the KG to the field of failure is still relatively few. Still, there is still a certain demand for constructing KG in the field of failure. For example, [13,14,15,16] all addressed the needs of the field of failure. Therefore, in 2107, Yuan-cheng et al. provided a definition of a formal Fault Knowledge Graph [13] based on the collection of a large number of mechanical engineering fault data. The specific applications and methods used are listed below in tabular form (Table 1).

Table 1. Fault domain application and construction technology

As can be seen from the above chart, the types of faults includes engineering machinery, train, power grids, power equipment and so on. In [13], Yuan-cheng LU et al. provided construction process of FKG and proposed a data-driven KG iterative automatic construction method. In [16], the author used entity extraction and relation extraction techniques and combined the relevant data of multi-source heterogeneous power equipment to construct a power equipment KG, and to improve the efficiency of power equipment management. In [15], the author used information extraction.

3 Key Technologies

The public construction process of the knowledge graph includes Data Acquisition, Knowledge Acquisition, Knowledge Fusion and Knowledge Processing [18], as shown in below (Fig. 1).

Fig. 1.
figure 1

The knowledge graph construction process

3.1 Data Acquisition

The first problem facing the construction of a KG is the source of the data. There are three types of data sources, which are structured, semi-structured, and unstructured. Structured data is stored in the database and tables [19]. Semi-structured data has a specific structure but is not very strict, such as xml data. Finally, unstructured data has no predefined data model information, such as publications, web pages or social media [20]. Most data sources are obtained by crawling unstructured data (such as an encyclopedia) as a data source.

As mentioned earlier, data sources can get data from web pages or extract them from databases. Then the technology needed to get data from the Internet, namely web crawler technology [21]. It described Mercator, a web crawler that is fully written in Java and can be extended. In [22], the author pointed out that web crawlers are a recursive process, and the user has to add restrictions, such as specifying the maximum number of tags or documents to retrieve, and the time limit. To improve the quality of the data, users can limit the crawl to a specific domain or file format and apply a blacklist of unwanted URLs/domains.

In [23], it provided the technology of Micro-blog Website, including depth control, breadth control and URL controller. URL controller is important. Because we have to consider whether this URL is suitable for crawling is very important for the results obtained, such as too many nodes or poor data quality, etc.

3.2 Information Extraction

KG consists of nodes and their relationships. The nodes include entities, concepts and literals. Entities are real-world individuals. Concepts represent a set of individuals with the same characteristics. Literals are strings indicated specific values of some relations. Information extraction is a kind of automatic entity extraction from semi-structured data and unstructured data [24]. Information extraction includes entity extraction, relationship extraction and attribute extraction [25].

(1) Named Entity Recognition

Named Entity Recognition was presented at the 6th MUC Conference (MUC-6, the Sixth Message Understanding Conferences) in November 1995. The core idea is to identify and classify the proper nouns needed for a given text. NER’s method research is divided into three main categories: firstly, dictionary-based and rule-based methods; secondly, traditional machine learning methods, the main conditional random field (CRF) and support vector machine (SVM), Long Short Term Memory Network (LSTM), Bi-directional LSTM (BILSTM), Part-Aware LSTM (PLSTM); thirdly, Deep learning methods such as Convolutional Neural Network (CNN) [73] and Recurrent Neural Network (RNN) [74].

A rule- based approach is to manually build rules and then match the strings of those rules in the text. The most representative of these is the DLCoTrain method proposed by Collins [35]. They proposed two algorithms. One is boosting-like framework. Another described an algorithm that directly optimizes this function. Similar to this, there was also a way to automatically generate rules through Bootstrapping [36].

The machine learning based approach is implemented by a categorical approach. Mainly through the method of serialization annotation, its main models include SVM (Support Vector Machine) [37], ME (Maximum Entropy) [38], HMM (Hidden Markov Model) [39, 40], CRF (Conditional Random Field) [41].

In [85], Thanh Hai Dang et al. proposed a novel named entity model. The model uses conditional random fields and bidirectional long-term short-term memory, and the experiment works well. In [86], Hui Chen et al. proposed a simple but effective CNN-based network, the Gate Correlation Network (GRN). This network is better at capturing context information than CNN. At the same time, Parallel Recurrent Neural Networks and CNN-RNN have promoted the development of NER

(2) Relationship Extraction

The entity relationship extraction task was first proposed at the 7th Message Understanding Conference (MUC) in 1998 [60]. Many research methods for entity relationship extraction include pattern matching [61], dictionary-driven [62] and machine learning methods. At present, the study of entity relationship extraction mainly uses machine learning or deep learning. The machine learning algorithm considers the relationship extraction as a classification problem, constructs the classifier when training the corpus, and finally applies it to the category judgment of the entity relationship.

Machine learning-based entity relationship extraction methods include supervised methods, unsupervised methods, weakly supervised methods, remotely supervised methods, and open domain-oriented methods. There was a supervised method in [63] that could obtain better performance by determining the relationship of entities in a sentence by a given entity and a sentence containing a pair of entities. The disadvantage of this method is that it requires a lot of manual labeling when training data. The unsupervised learning [64] method did not require manual labeling of corpus, but the performance of relational extraction was poor. The method of weak supervised learning was bootstrapping at the earliest [65]. This algorithm is simple and easy to operate, but it uses a lot of statistical assumptions, so it is assumed whether the accuracy of sampling is established. Later, remote supervision appeared, using the alignment of the open knowledge base to automatically mark the corpus, reducing the dependence on manual annotation data, and enhancing cross-domain adaptability. However, its shortcoming is that it will bring a lot of noise to the corpus, so solving the noise has become a problem that scholars are concerned about. If there is a relationship between the two entities in the knowledge base, then the sentence containing the two entities has more or less the relationship. Open-domain-based relationship extraction is not limited to relationship categories and text classification, and does not need to be labeled corpus, which is suitable for processing large-scale network data, but the extracted results require a lot of processing, and there is no objective evaluation standard.

The above method requires the use of the NLP tool, but the NLP itself will have system errors, so when the algorithm is used later, the performance of the algorithm will be degraded. The deep learning method is applied to the relationship extraction, and the advantages of feature extraction and automatic learning are taken. SemEval-2010 task 8 [66] was used as the test standard. Deep learning based algorithms include recurrent neural networks and convolutional neural networks. The RNN model [67] set vectors and matrices, learned the meanings of propositional logic and natural language operators, but relied on traditional NLP tools to focus on semantic learning, thus reintroducing tool noise. Therefore, Zeng et al. [68] used convolutional neural networks to extract the hierarchical features of sentences and vocabulary for relational extraction, reducing the pre-marking processing of input materials. Then Nguyen et al. [69] added convolution kernels of different sizes to the convolutional layer as filters to extract more features and add position vectors. Lin et al. [70] introduced PCNN (piecewise CNN), which pooled the feature map into three segments for the two physical locations of the pooled layer of the traditional convolutional neural network, while using the attention mechanism to establish the sentence level. Selective attention to the neural model mitigated the problem of mislabeling.

Remote supervisory relationship extraction extends relation extraction to a very large corpus. Neural relation extraction has made great progress in modeling sentences in low-dimensional space, but lacks consideration of simulated entities. Firstly, the context coding on the dependency tree is embedded as a tree-GRU-based sentence entity, and then the relationship between the sentence and the sentence is used to obtain the entity embedding of all sentence sets. Finally, the sentence embedding and the entity embedding are combined to classify the relationship. Better performance extraction [78]. In 2019, XIAOYU GUO proposed a combination of CNN and RNN for relationship classification based on single attention. In 2019, PLSTM-CNN [81] based on remote supervision was used to implement relationship extraction. At the same time, Multi-Gram CNN-Based Self-Attention [82] is also based on remote supervision for relationship classification. In the same year, attention-based Att-RCNN [83] was also used to achieve relationship classification. For the classification of semantic relations, SHEN Y proposed a neural network based ED-LSTM [84] algorithm.

However, the above entity relationship extraction is to extract the entity and the relationship separately, which will generate redundant information, so the joint extraction is proposed. Extract the entities and their relationship types in the same model, realize parameter sharing and synchronization optimization, and reduce the possibility of extracting errors before extraction. Zheng [71] et al. used the underlying expression of shared neural network for joint learning. Li et al. [72] proposed a joint structured extraction method for incremental cluster search algorithm and a constraint method using global features.

3.3 Knowledge Fusion

Information extraction obtains attribute information of entities, relationships, and entities from raw unstructured and semi-structured data. But this information contains a lot of redundancy and error messages. Therefore, it needs to clean up the extracted knowledge items. Knowledge fusion aims to solve errors in network data and errors in knowledge extraction. Knowledge fusion mainly includes entity links and knowledge integration [26].

(1) Entity Linking

Entity link is an operation that connects an entity object extracted from text to the corresponding item in external databases or knowledge bases. Wikipedia is often used as one of the knowledge bases for entity linking. The entity link was to assign the reference item to the correct entity object through disambiguation processing and co-finding digestion [27]. Tomoaki Urata proposed a disambiguation method for Wikipedia’s Weibo entity links [28]. Entity disambiguation has four steps. The first step is to obtain the candidate entity. This requires obtaining a Wikipedia article for the candidate entity from the Wikipedia disambiguation page and preparing for the next entity disambiguation. The second step is to compare the nearest entity in the target entity with the Wikipedia article for each candidate entity and match the entity in Wikipedia. The third step is to calculate the similarity. They use word2vec to get the relevant entity of the nearest entity. The final step is to multiply the results of the second and third steps and extract the Wikipedia article with the highest score as the correct entity.

However, there is a lack of relationship between some entities and entities, so link prediction is required. In some algorithms, Daniel Neil proposed a new model, Graph Convolutional Neural Networks (GCNNs) [77], which improves performance on clear datasets and accommodates noise in KGs pervasive issue in real world applications. What’s more, this model is more interpretable. Because it allowed measurement the effect of a particular edge on prediction by adjusting the link weight or completely removing the edge. In 2019, Binling Nie combined latent feature models and graph feature models to propose a model of Text-enhanced KG Embedding (TKGE), which can perform inference over entities, relations and text [80].

(2) Knowledge Integration

Knowledge Integration is to combine knowledge from multiple, distributed, heterogeneous knowledge sources [18]. In the previous entity link, it linked data extracted from semi-structured or unstructured data, but there were structured data packages for external or relational databases. The process of knowledge integration involves concept matching, entity matching, the evaluation of knowledge and the resolution of conflicts [29].

3.4 Knowledge Processing

(1) Ontology Construction

Ontology is a formal language used to construct ontology. It is a descriptive language and a metaphysical form of framework language. Ontologies include individuals, classes, attributes, function terms, constraints, rules, axioms, events, and so on. In the structure of ontology, it contains the relationships between things, the nature of things, the constraints of things and so on. In [30], the author proposed an automatic ontology extension method based on supervised learning and text clustering. This method used the k-means algorithm to separate domain knowledge and guided the creation of the Naïve Bayes classifier training set. First, the candidate set words will be added to the target ontology. At the same time, the noise words will be added to the stop word dictionary, which can automatically expand the ontology. The experimental results show that the expansion effect is very good.

(2) Knowledge Inference

After completing the ontology construction step, the relationship between many nodes and nodes of the KG is still vacant. Therefore, we need to fill in these relationships through knowledge reasoning to better improve the KG. In [31], Lucas Fonseca Navarro et al. proposed the Graph Rule Learner (GRL), a method for extracting inference rules from the ontology knowledge base graphed to graphs, and explored the combination of link prediction indicators. The input to GRL is a ontology graph, and the output is a list of induced inference rules. GRL uses the link prediction metric of the extra neighbor [32] to rank the possible rules, and it’s scalable, using a structure called Graph DB-Tree [33] to store the graphical representation on disk.

In [34], the author proposed a unified ontology reasoning framework to incorporate co-occurrence and subclass relationships, and this framework can also automatically build ontology. The article pointed out that subclass relationships could be obtained from WordNet. It covers almost all common nouns and can find the relationship between words and words. Then get the co-occurrence relationship from the training set. This framework effectively combines the concept of reasoning with superior performance and other methods.

In 2013, BORDES proposed to encode the triplet into a low-dimensional distributed vector, the TransE model. TransE [42] is a translation model, which considers the relationship between the three heads (head, relation, and tail) as a translation from head to tail. By constantly adjusting the relationship between head, relation, and tail, h + r is equal to t as much as possible. But TransE is dealing with the properties of some relational graphing, such as reflexivity. So, in 2014, ZhenWang1, Jianwen Zhang and others proposed TransH [43] model. TransH models the relationship as a relational hyperplane along with the translation operations on it, which can preserve some graphing properties such as reflexivity, one-to-many, many-to-one, many-to-many, etc. when sneaking, and in efficiency It is almost the same as TransE. TransH has a predictive accuracy comparable to that of TransE. Both TransE and TransH assume that entities and relationships are in the same space, but in fact one entity has multiple attributes. Different relationships focus on different attributes of the entity. In other words, some similar entities should be close to each other in space, not similar in space. The middle should be away from each other. So in 2015, Yankai Lin et al. proposed the TransR model [44], Embedding entities and relationships into different spaces and implementing translations in the corresponding spatial relationships. The disadvantage of the TransR model is that the parameters are too many and too complicated. So in the same year Guoliang Ji, et al. proposed the TransD model [45], which not only considers the diversity of relationships, but also considers the entity. The main idea is to use two vectors to represent entities and relationships, one (h, r, t) for entities or relationships, and one for dynamically constructing graphing matrices. The advantage is that there are fewer parameters and no multiplication of matrix vectors, and it is applied to large-scale graphics. The previous representations oversimplified the loss metric and did not have enough power to simulate complex entities and relationships in the knowledge base. Han Xiao proposed the TransA model [46], which replaced the metric function, used the metrics to learn the sneak method, and treated each dimension in the vector differently, improving the representation ability. In 2016, in order to solve the problem of heterogeneity and imbalance, the connection relationship of the entity is complicated and simple, and the number of head and tail of many connection relationships is not equal. Guoliang Ji proposed the TranSpare model [47]. The core idea is that the transfer matrix is replaced by an adaptive matrix, and the sparsity of the adaptive matrix is determined by the number of pairs of entities, thus preventing under-fitting of complex relationships or over-simplification of simple relationships. Then, in order to solve the problem of multi-relational semantics, the TransG model emerged, and a Bayesian nonparametric Gaussian mixture model was used to generate multiple translation parts for a relationship. This discovered the potential semantic relationship and embeds a triple with the mixture-specific component vector [48]. Finally extended to the KG2E model [49]. The new idea of using the covariance of multidimensional Gaussian distributions to represent the uncertainty of entities and relationships can properly represent its certainty. The above models are all extensions to the TransE model.

3.5 Knowledge Storage

After obtaining the relationship between the entity and the entity, the data is used to form a KG. Usually, the ontology acts as a carrier for KG. Ontology language based on Web ontology includes Resource Description Framework (RDF) [57] and Web Ontology Language (OWL) [58]. RDF can be used to describe resources on the network and their relationship to each other, including resources, attributes, and relationships, in the form of triples. Later, RDFS appeared which mainly added some vocabulary to expand the ability of RDF. OWL is still an extension of RDF’s vocabulary. It has strong knowledge representation and knowledge reasoning ability, and adds representations of classes, vocabulary and relationships. So OWL is more expressive and more powerful than RDF.

The other is storage based on a graph database, such as Neo4j [59]. The Neo4j graphics database is based on attribute graphs and focuses on queries and searches. The disadvantage of Neo4j is that it does not support distributed. Although OrientDB and JanusGraph (formerly Titan) support distributed, they are immature. So choosing Neo4j to store KG is better. And it separates the relationship between nodes and nodes. The last set of triplet row vectors is stored in csv format and then imported into the Neo4j database (graphics database), which is also convenient for future queries and optimizations.

4 Tools and Platforms

For the processing of data, in addition to some algorithms, it is also possible to implement information extraction through some open source natural language tool processing packages. Below we introduce the open source toolkits written by Python and Java, and analyze their functions. And the language that is suitable for processing, Chinese or English.

Gensim [50] is an open source Python toolkit that handles raw unstructured text and unsupervised learning of topic vector representations of text. It supports a variety of topic model algorithms including TF-IDF, LSA, LDA, and word2vec, supports streaming training, and provides API interfaces for common tasks such as similarity calculation and information retrieval.

Stanford CoreNLP [51] is an open source toolkit for relation extraction based on Java. Its methods such as supervising, remote monitoring, and neural network can realize the analysis of natural language texts, including part of speech restoration, part-of-speech tagging, named entity tagging, co-finger decomposition, syntax analysis and dependency analysis.

NLTK [52] is a third-party toolkit based on Python. NLTK is suitable for processing English, but there are some restrictions on handling Chinese. NLTK does not have a Chinese stop word due to the lack of a Chinese corpus. And the stop words of Chinese text cannot be filtered, so Chinese text cannot be segmented. Therefore, NLTK is not suitable for processing Chinese text.

FudanNLP [53] is a Java-based Chinese open source toolkit that contains machine learning algorithms and data sets. It enables information retrieval such as text categorization and news clustering. For Chinese processing, it includes Chinese word segmentation, part of speech tagging, entity name recognition, keyword extraction, dependency syntax analysis and time phrase recognition.

Stanford University has developed a toolkit that supports Chinese processing, namely deepdive [54]. It is used for knowledge extraction, the extraction of triples. It extracts structured relational data from unstructured text through weakly supervised learning. This project has modified the model package for natural language processing to support Chinese and provide Chinese tutorial.

SOFIE [55] is an automated ontology extension system developed by the max planck institute. It can extract ontology-based events from text, implement ontology links, and perform logical reasoning for disambiguation.

OpenCCG [56] is a Java-based open source natural language processing library that implements text grammar based on Mark Steedman’s combination of grammatical forms, including syntax analysis and dependency analysis.

OLLIE is a three-tuple extraction component of KnowItAll, a knowledge base developed by the University of Washington. It enables the extraction of relationships based on grammar-dependent trees and can extract relationships over long distances. Reverb is also an open ternary extraction tool developed by the University of Washington. It can extract triples of entity relationships from English sentences. The advantage is that it does not need to specify relationships in advance, and supports information extraction on a network-wide scale. This reduces a lot of manual intervention.

ICTCLAS (NLPIR) is a Chinese language open source tool based on multiple languages, such as Java, C++, C, C#. It is mainly used to deal with Chinese, such as Chinese word segmentation, named entity recognition, part-of-speech tagging, custom user dictionary, new word micro-blog word segmentation, new word discovery and keyword extraction. Visual interface operation and API call provide users with Great convenience.

In summary, most open source toolkits for Chinese natural language processing are implemented in the Java language, while most of the self-language toolkits that handle English are written in the Python language. NLTK is suitable for English processing. FudanNLP, deepdive, ICTCLAS are suitable for Chinese processing. Gensim processes text based on an unsupervised learning algorithm. Stanford CoreNLP processes text based on remote monitoring and neural networks.

5 Conclusion and Futures Challenges

KG is an important branch of artificial intelligence. It simulates the way people think, and carries out efficient knowledge management and knowledge acquisition on data. This article aims to use crawling technology to crawl some fault data and then generate KG through natural language processing. This allows a clearer understanding of the relationship between faults and faults, and analysis and prediction to prevent failures. Natural language processing mainly uses the entity relationship extraction technology. In the previous introduction, there are many algorithms and tools. The algorithms generally have unsupervised methods, supervised methods, remote monitoring methods and bootstrapping methods. Entity and relationship extraction have first entity extraction and relationship extraction and joint extraction of entities and relationships. Nowadays, although there are more separate extraction algorithms, its extraction is limited by entity extraction. Joint extraction is better than separate extraction, but there are no particularly mature algorithms. The specific algorithms of entity relationship extraction mainly include entity relationship extraction based on deep convolutional neural network, circular convolutional neural Venaero algorithm, CRF-based named entity, semi-supervised learning method combined with word rules and SVM model and open Chinese. A major problem in the extraction of entity relationships is the need for a large number of manual annotations. However, many Chinese entities can also be used to extract relational tools such as NTLK, Sanford CoreNLP, OLLIE, SOFIE, and so on. However, NTLK is not suitable for the processing of Chinese texts. It lacks Chinese corpus and stop words. Most of the tools developed are not very suitable for processing Chinese texts.

The important point is that most of the projects related to KG involve English. However, Chinese is a different language. It is not feasible to convert English KGs into Chinese. Thus, the construction of Chinese KG is very significant. The main difficulty lies in the following three points: (I) quality of data sources, (II) taxonomy derivation and (III) knowledge harvesting [75].

The construction of previous KG was constructed with node-relationship-nodes, but we still lack the unified definition and standard expression of KG. Therefore, in order to make the KG clearer, Yucong Duan clarified the structure of the KG from the aspects of data, information, knowledge and wisdom, and suggest to specify the KG in a gradual manner, including data graphs, infographics, KG and wisdom graph [76].