Keywords

1 Introduction

The main aim of information system integration is to achieve the centralized storage and full access of the data from various information systems, share the workflow and provide an integrated information system for collaborative business. It is a common phenomenon that various conflicts, e.g., diverse format, naming conventions and semantic heterogeneity will occur when we manage to integrate heterogeneous information from different information systems [1]. Hence, the key task of information system integration is to eliminate the heterogeneity of the data and workflow between different information systems.

Ontology is one of the essential knowledge representation methods that have been widely adopting in the fields of data fusion and information system integration due to its high machine-readable and semantic interoperability. Especially, ontology could represent semantic interoperability within different concepts, instances relations and axioms related to the specified domain [2], which provide an opportunity to integrate the heterogeneous data and information systems at the semantic level. Hence, an ontology-based information integration approach has been playing a critical role in the integration of the information system. Traditionally, the process of ontology construction is a time-consuming task that requires a lot of manpower and effort [3]. There is no doubt that the efficiency of the ontology-based information integration was limited by the automation degree of the ontology construction.

Ontology learning (OL) is a kind of ontology construction approach based on the machine learning technique [4]. It was proposed to (semi-)automatically extract the knowledge from the text document or database for constructing ontology efficiently [5]. In recent years, there is a great technological advancement in the fields of ontology construction, ontology mapping and semantic integration accompanied by the development of machine learning and computational intelligence [6]. Consequently, several novel approaches and techniques, e.g., automated ontology notation, dynamic ontology mapping, ontology refinement and so forth, have been applying in the fields of machine translation and question answering system [7]. In contrast to the aforementioned fields, the integration of the information system based on ontology learning is a new topic.

This survey paper focuses on how ontology learning could be adopted and play a vital role in the integration of information systems. The rest of this survey paper is structured as follows. Initially, the previous surveys on the topic of ontology-based information integration and ontology learning are concluded in Sect. 2. Then, the recent techniques and tools that support ontology learning from text and relational database are presented in Sect. 3. After that, the possibility of using ontology learning in information integration was analyzed based on the mapping results between the features of ontology learning and bottleneck problems of ontology-based information integration in Sect. 4. The potential directions of using ontology learning in information system integration and the conclusion of this paper were discussed and summarized in Sect. 5 and Sect. 6 respectively.

2 Summary of Previous Surveys

The previous surveys focused on the major bottlenecks of semantic integration, e.g., ontology mapping, formal representation and reasoning of mappings, from the perspective of ontology-based integration.

2.1 Ontology-Based Information Integration

Ontology-based information integration could achieve the integration at the semantic level, hence, Noy [8] surveyed the ontology-based approaches for semantic integration. The conclusion was drawn that automated mapping will be conducive to alleviate the constraints of ontology-based information integration, hence heuristic-based approaches of ontology mappings, e.g., machine learning, ontology learning, and so forth, should be studied further for improving the automation of ontology mapping.

Ontology-based information extraction is a critical component in the ontology-based integration framework, which provides the source of the information and knowledge for constructing ontology. Thus, Wimalasuriya et al. [9] surveyed and classified existing ontology-based information extraction (OBIE) approaches, from the technological perspective, e.g., linguistic rules, finite-state automata, classification, the partial parse tree, web-based search, tools and performance measures. They concluded that existing approaches to information extraction mainly rely on the linguistic rules that identified manually. Besides, the availability of the existing methods for measuring the performance is limited by the efficiency of identifying instance and property values.

Ontology mapping could support information integration by representing the relationship between global ontology and local ontology, hence ontology mapping is also a critical technique for ontology-based information integration. Thus, Hooi et al. [10] surveyed the existing ontology mapping techniques and tools. They focus on the analysis of existing mapping techniques and matching algorithms, which highlight the matcher is a core component of ontology mapping. They concluded that the majority of the matcher is designed on a specific domain, in this situation, the re-usability of mapping tools is restricted.

2.2 Ontology Learning

The model of ontology learning is usually built based on the techniques from machine learning, NLP (Natural Language Processing) and information retrieval [11]. The techniques of ontology learning could be classified into the statistical approach, natural language processing approach, and integrated approach.

To investigate the existing techniques of ontology learning, Biemann [12] surveyed the techniques of ontology learning from unstructured text, e.g., clustering, distributional similarity, co-occurrence matrix, decision tree. The conclusion was drawn that the majority of the existing approaches to ontology learning from unstructured text use only nouns and ignore the relationship between various words and classes. The past decade has witnessed tremendous progress regarding the techniques of machine learning and the semantic web. To investigate the recent techniques of ontology learning, Asim et al. [6] systematically classified the methodology of ontology learning into three categories: linguistics techniques, statistical techniques, and inductive logic programming. They compared the performance of each ontology learning techniques, and the accuracy of the ontology learning based on inductive logical programming up to 96%.

The conclusion could be drawn that the majority of aforementioned surveys on the topics of ontology-based information integration and ontology learning were conducted separately, there is rare work that surveys the opportunity of using ontology learning in information integration. However, in recent years, some bottlenecks of the traditional method for constructing ontology are emerging, i.e., time-consuming, error-prone, and semantic loss, which bring the unprecedented challenges of the traditional ontology-based information integration. OL probably provides a new perspective to tackle the above issues, thus, this survey paper aims to investigate the potential opportunity of using ontology learning in information system integration.

3 Ontology Learning Techniques

The majority techniques of ontology learning were borrowed from the NLP and data mining. The typical techniques of the terms and entities extraction are originated from NLP, e.g., tagging, syntactic segmentation, parsing, and so forth. The alternative approaches for implementing the NLP including machine learning and statistical inference. Moreover, the representative techniques of the relationship extraction were proposed based on the data mining algorithm, e.g., clustering algorithms, association rule mining, occurrence analysis.

3.1 Ontology Learning from Text

The mainstream techniques of ontology learning from the text could be classified into linguistics approach, machine learning, and the combination of the linguistics and machine learning. The representative works of the ontology learning from texts were summarized as follows.

To generate the ontologies from Web, Venu et al. [13] proposed a framework, they extracted the terms and relations by using of HITS (Hyperlink-Induced Topic Search) algorithm and Hearst Patterns respectively. The resource description framework (RDF) was adopted to store the extracted terms and their relations, then the ontology was constructed based on the RDF. OWL (Web Ontology Language) is a formal language for representing ontologies, which provides richer semantic representation than RDF. Thus, Petrucci et al. [14] developed a system to translate natural language into description logic (DL) based on the neural network. Based on the aforementioned work, Petrucci et al. [15] designed an ontology learning model based on a recurrent neural network (RNN) to extract OWL from a textual document. They focused on improving the performances of ontology learning, i.e., domain independence, accuracy, and so forth.

In addition to the machine learning techniques, the linguistics techniques were also utilized to construct ontology, Rani et al. [16] studied a semi-automatic terminology ontology learning approach based on LSI (Latent Semantic Index) and SVD (Singular Value Decomposition). This approach could semi-automatically create a terminological ontology based on the topic modeling algorithm by using ProtégéFootnote 1. To extract the terms and relation from cross-medial text automatically, Hong et al. [17] proposed a domain ontology learning method based on LDA (Latent Dirichlet Allocation) model. In this model, the NLPIR (Natural Language Process Information Retrieval) and LDA subject models were adopted to extract the terms and their relations respectively.

To improve the dynamic of ontology learning, Dutkowski et al. [18] disclosed a framework of the ontology-based dynamic learning from text data. In this patent, the inference techniques were adopted to extract the relation between entities from the data. Besides, the statistical techniques, i.e., entities measurement, and relation score were applied to extend the ontology learning from static learning to dynamic learning. Considering the weak interactivity of the existing algorithm, Ghosh et al. [19] built an ontology learning experimental platform based on the Text2OntoFootnote 2 for learning the domain knowledge from text semi-automatically. In this work, the TF-IDF (Term Frequency-Inverse Document Frequency) concept extraction algorithm and relation extraction algorithm based on Subcat Frames were adopted to extract the terms and their relations respectively. The extraction techniques of the term, relation, and the input & output of the aforementioned works could be summarized in the Table 1.

Table 1. Summary of the techniques of ontology learning from text.

Based on the above summaries, the conclusion could be drawn that the majority of the OL model from the text was built based on machine learning techniques and linguistics techniques. The outputs of the model could be classified into three categories, formal ontology, semi-formal ontology, and information ontology. However, the existing ontology learning tools are semi-automatic which is limited by the performance of the algorithms.

3.2 Ontology Learning from Relational Database

Relational database (RDB) has been the majority source of the knowledge, which could provide the conceptual model and the metadata model for constructing ontology [20]. Hence, how to construct the ontology from the RDB efficiently and effectively has attracted the attention of the researcher. To tackle the aforementioned issues, ontology learning from RDB was investigated in recent years.

There are two critical phases of constructing ontology from RDB based on ontology learning. In the first phase, the RDB schema is usually transformed into RDFS (RDF Schema) based on the DL and rule mapping. In the second phase, the semantic relationships are extracted and the ontology is generated from RDB by using semantic measurement and machine learning. The specified techniques of ontology learning from RDB could be depicted in Fig. 1.

Fig. 1.
figure 1

Techniques of ontology learning from RDB.

The mainstream techniques of the OL from RDB could be classified into four categories: reverse engineering, schema mapping, data mining, and machine learning. The corresponding work could be illustrated as follows. Considering the richer semantic of the conceptual model (ER model), Sbai et al. [21] utilized reverse engineering to analyze and transform the relational model to the conceptual model for building ontology from RDB. This method could recover the lost semantic information and database table during the transformation.

There are two alternative solutions for constructing ontology from RDB schema: transform RDB to RDF and mapping RDB to OWL. To implement transforming from RDB to ontology, Dadjoo et al. [22] designed a transforming method. This method consists of three steps: extract information (Meta-data) from RDB, build graph middle conceptual model and create the final ontology. When it comes to the mapping method, Hazber et al. [23] proposed an approach for mapping the relational database into ontology-based on mapping rules. Moreover, there are several tools have been developed for supporting the mapping from RDB to ontology, e.g., DataMasterFootnote 3, KAON2Footnote 4 and RDBToOntoFootnote 5. To improve the efficiency of the ontology construction, Aggoune [24] designed a semantic prototype based on the measurement of the similarity metric for automatic ontology learning from RDB. In this semantic prototype, the similarity measurement was employed to detect the synonymy relation based on WordNetFootnote 6. However, due to the RDB model does not store the semantic relationship among entities directly, there are some limitations of the automatic ontology learning from RDB, i.e., identify the incorrect semantic relationships between entities, ignore the implicit relations. To tackle the above issues, El Idrissi et al. [25] studied a novel approach of ontology learning from RDB based on semantic enrichment, in which the meta-model was introduced to augment the semantic of RDB model. The case study shows that this approach could deduce the relationship in various domains.

Given that not only the schema information is implied in RDB SQL, but also the data information is represented in RDB SQL. Hence, a new paradigm of ontology learning from SQL scripts was proposed in recent years. Hazber et al. [26] proposed a method for translating SQL algebra into SPARQL queries based on mapping rules. Initially, the RDB schema and data were transformed to the RDF triples, after that, the RDF triples were translated into OWL.

Generally, ontology learning from RDB SQL consists of three phases: pre-process, semantic enrichment, and transformation mapping. Before the transform and mapping, it is necessary to pre-process the RDB SQL. The majority of techniques of the pre-processing is parsing and lemmatization. To tackle the existing parsing methods that ignore the structure of database schema, there are two parsing methods of Text-to-SQL was proposed based on Graph Neural Network [27] and IRNet [28] respectively, which provide an essential theoretical foundation to construct ontology based on the approach of ontology learning from RDB SQL automatically.

4 Use of OL in Information System Integration

When it comes to information system integration, there is a consensus that ontology-based integration is a useful approach. However, there are some bottleneck problems that influence the performance of integration, while ontology learning provides a new perspective to tackle these bottlenecks.

4.1 Statement of the Existing Challenges

With the increasing volume of heterogeneous data from the various information systems, some bottleneck problems (BP) of the ontology-based information integration are emerging in recent years. The corresponding questions could be summarized as follows:

BP1: How to improve the efficiency and effectiveness of the ontology construction?

BP2: How to preserve the integrity of the semantic information and avoid the semantic loss in the construction of ontology?

BP3: How to access the data from various DBMS (database management system) of the different information systems efficiently?

BP4: How to learn and generate the domain-related knowledge from the increasing (semi-)structured data of the various information systems?

4.2 Mapping the Features of OL to Bottleneck Problems

According to the above investigation of the OL, the features and strengths of the OL could be formulated as follows:

(Semi-)automatic. In contrast to the traditional methods of ontology construction, the approach based on OL could construct domain-related ontologies (semi-)automatically by learning the knowledge from corresponding data. It could minimize the manpower and improve the efficiency and effectiveness of the generating ontology. Ontologies could be constructed based on the extraction of the entities and their relationships by using the techniques of machine learning and natural language processing.

Active Learning. OL is a paradigm of active learning, hence it is suitable for the large-scale data sets. In active learning, the model could select an unlabeled item from the dataset and present it to the user to obtain the label, which is beneficial to improve the efficiency of the learning [29]. Therefore, with the increasing volumes of the data, the accuracy and integrity of the semantic of the OL model will be improved. More importantly, it is unnecessary to label the data manually, which will create an opportunity to tackle the larger data sets.

Semantic Integrity. The methods of OL from RDB could maximize the preservation of semantic integrity because the RDB implicates strong semantic relationships among the original data. Especially, the RDB model could be converted into a conceptual model, which will enrich the semantic relation among entities to some extent. Therefore, the information integration based on OL could preserve the consistency and integrity of the semantics between original data and the corresponding ontology.

Information Accessibility. The main data sources of the OL is the RDB, while the RDB is easy to be accessed by the interface or the pipeline from the DBMS. Moreover, there is no requirement of the special interface, because the data could be access from the DBMS via the RDB scripts directly if the interface or the pipeline is unavailable for some legacy information system.

To investigate the opportunity of using OL in information integration, the bottleneck problems of the ontology-based information integration (OBII) and the features of the OL are mapped in the Table 2.

Table 2. Mapping the features of OL to the bottleneck problems of OBII.

As shown in Table 2, the aforementioned features of the ontology learning are mapped with the bottleneck problems of the ontology-based information integration at many points. The result showed that ontology learning could provide an opportunity to tackle the above bottleneck problems.

5 Opportunity of Using OL in Information Integration

5.1 Summary of Existing Work

According to the results of the literature retrieval, there is a minority number of the existing works on the topic of information integration based on ontology learning. The corresponding works could be summarized as follows.

Initially, the techniques of using ontology learning to integrate data of the semantic web were analyzed and illustrated by Xu et al. [30]. And then, an approach of smart data integration based on goal-driven ontology learning was proposed by Chen et al. [31]. In this approach, the statistical method and NLP techniques were utilized to extract the relations of the entities, also, the prototype for ontology learning was developed. In our previous work [32], the framework of using ontology learning for integrating the legacy ERP system was proposed, and the key steps of ontology learning for system integration was described.

Therefore, the conclusion could be drawn that the existing research of the information integration based on ontology learning is in its early exploratory phase. In spite of the techniques and frameworks that were illustrated by some works, some specified works should be investigated further.

5.2 Directions of Using OL in Information System Integration

Based on the above summaries of existing works and mapping results, the possibility and directions of using OL in information system integration.

Ontology Learning from RDB SQL Scripts. SQL scripts of the RDB are kinds of text documents out of which all entities and their semantic relationships can be inferred. Moreover, SQL scripts can be accessed easily via the DBMS or database driver, especially, there is no requirement for a special interface. Hence, based on the ontology learning from SQL scripts, the heterogeneous information from various information systems could be accessed and integrated efficiently and effectively. Currently, there are some work [27, 28] studied the algorithms of pre-processing the SQL scripts and transforming them to text, which provide the theoretical foundation for the ontology learning from SQL scripts. Hence, it is a meaningful work to investigate the algorithms, model, and tools for ontology learning from SQL scripts.

Ontology Learning from NoSQL Database. In several fields, an increasing number of NoSQL databases were built for storing the unstructured data and real-time data-driven by the business requirement. Consequently, there are some NoSQL databases, i.e., document database, graph database, object database, and so forth. Especially for the graph database, it is easy to extract the terms and relation because it already implies the potential relation between different objects. Moreover, there are some works have focused on the ontology learning from the NoSQL database [33, 34], which will provide the possibility to integrate the unstructured information based on ontology learning from NoSQL database. Therefore, it is an interesting direction of integrating the unstructured information based on ontology learning from the NoSQL database.

6 Conclusion

This paper surveyed the latest developments of ontology learning for highlighting the possible applications in the scenario of information system integration. The existing surveys of ontology-based information integration and ontology learning were summarized, and the recent techniques and tools of the ontology learning from text and RDB were investigated respectively. Besides, the current challenges of ontology-based information integration were discussed, and the features and strengths of using ontology learning in information system integration were spotted. Also, the opportunity of using ontology learning in information integration were given by showing directions of investigating the ontology learning from RDB SQL scripts and the NoSQL database.