1 Introduction

Ontology construction from text is a procedure that involves analysing the collected text for a specific domain; identifying the relevant terms, concepts and their relationships; mapping and representing the ontology by representation language [e.g. OWL (Web Ontology Language), RDF (Resource Description Framework), or RDFS (Resource Description Framework Schema)]; and finally evaluating the constructed ontology. The procedure of ontology construction, in general, can be done in one of three ways: manual construction; cooperative construction (need the human intervention during the ontology constructing process) and (semi-) automatic construction. Ontology Learning (OL) from text is a process of (semi-) automatic ontology construction from text.

In recent years, many OL approaches and systems that try to automate the construction of ontologies have emerged. OL systems change the way of processing textual data from text mining to knowledge mining. This knowledge has to be represented in the form of concepts and relationships between those concepts (ontologies) to be in machine-understandable form.

The purpose of this paper is to give a review in ontology construction approaches, challenges, systems, and explain the importance of proceeding towards DL rather than shallow learning. In addition, it discusses and illustrates how to enhance the process of ontology construction from the text by using DL. The rest of this paper is organized as follows: Sect. 2 defines ontology construction and its challenges. Section 3 presents the ontology construction layers and process while Sect. 4 presents the ontology construction approaches. Section 5 presents the ontology construction systems and a comparison among them as well as the evaluation metrics of the ontology construction systems. Section 6 explains and discusses the power and importance of DL for ontology construction. Section 7 presents and explains how to use the DL for the ontology construction process. Finally, we conclude in Sect. 8.

2 Ontology construction

2.1 Formal definition

2.1.1 Ontology

According to W3C (World Wide Web Consortium) “Ontologies define the terms used to describe and represent an area of Knowledge”. Ontology is a data model that represents a set of concepts and the relationships among those concepts within a domain (Mishra and Jain 2015). Zouaq (2011) defined the components of an ontology by the following tuple:

$$ O = < C, H, R, A > $$

where \( O \) represents ontology, \( C \) represents a set of classes (concepts), \( H \) represents a set of hierarchical links between the concepts (taxonomic relations), \( R \) represents the set of conceptual links (non-taxonomic relations), and \( A \) represents the set of rules and axioms.

2.1.2 Ontology construction

Ontology construction can be defined as an iterative process of creating an ontology from scratch or reusing an existing ontology for enriching or populating. The process of constructing the ontology includes six tasks as follows:

  1. 1.

    Specifying the domain to create well-defined terms and concepts.

  2. 2.

    Identifying the key terms, concepts, and their relations in the domain.

  3. 3.

    Establishing or inferring the rules and axioms that describe the structural properties of the domain.

  4. 4.

    Encoding (representing) the constructed ontologies by using representation languages which support the ontology such as RDF, RDFS or OWL.

  5. 5.

    Combining the constructed ontologies with existing ontologies (if available).

  6. 6.

    Evaluating the constructed ontologies by using generic and specific evaluation metrics.

The numbering of these tasks does not indicate the order of execution of them, and some of these different tasks may work iteratively to be accomplished. Ontology construction process may be done by one of the following three ways:

  1. 1.

    Manual construction: experts perform manual construction of ontology

  2. 2.

    Cooperative construction: most or all tasks of the ontology construction system are supervised by experts.

  3. 3.

    (Semi-) Automatic construction: the ontology construction process is performed automatically with limited intervention by users or experts. Automatic construction means that the level of human intervention is slightly less than semi-automatic construction but does not mean fully automatic construction. It is worth mentioning that full automatic construction for ontology by a system is still a significant challenge and it is not likely to be possible (Maimon and Browarnik 2015; Wong et al. 2012).

2.2 Challenges in ontology construction

Following the closer look into many ontology construction studies, it is clear that there is a consensus among several aspects of ontology construction challenges that require more efforts. The following list presents the common aspects that define the main challenges of automatic ontology construction:

  1. 1.

    Fully automatic construction for ontologies could be not possible, but there is an acute need for more effort to decrease human intervention in the ontology construction process to build (semi-) automatic construction rather than the existing cooperative systems of ontology construction (Buitelaar et al. 2005; Gómez-Pérez and Manzano-Macho 2003; Maimon and Browarnik 2015; Mishra and Jain 2015; Wong et al. 2012; Zhou 2007).

  2. 2.

    The need to avoid the noise terms (irrelevant or very general) that lead to unnecessary additional efforts required for pre-processing. This could be addressed by paying more attention to filter terms in the ontology construction process as early as possible, i.e. in the early stage of construction (Wong et al. 2012; Zhou 2007).

  3. 3.

    Discovering the relations between concepts still unsatisfactory in terms of its result and more efforts are needed in this aspect. (Albukhitan et al. 2017; Buitelaar et al. 2005; Maimon and Browarnik 2015; Wong et al. 2012; Zhou 2007).

  4. 4.

    Learning axiom is still in the initial stage on existing ontology construction systems and requires plenty of work (Maimon and Browarnik 2015; Mishra and Jain 2015; Wong et al. 2012; Zhou 2007).

  5. 5.

    The transformation of data whether it is from small static text collections, or from massive heterogeneous on the World Wide Web, should be taken into account when designing ontology construction system (Buitelaar et al. 2005; Mishra and Jain 2015; Wong et al. 2012; Zhou 2007).

  6. 6.

    Building a standard platform to evaluate ontology construction systems is still a hard task (Albukhitan et al. 2017; Buitelaar et al. 2005; Wong et al. 2012; Zhou 2007).

The validity of these aspects and challenges will be evident from further discussion on the approaches and prominent systems of ontology construction in the later sections.

3 Ontology construction via learning

OL is a process of automatically or semi-automatically creating of new ontology or reusing (for enriching or populating) an existing ontology; with the minimum exertion of a human (Gillani Andleeb 2015). OL from text is a process of acquiring knowledge from text by applying a set of methods and techniques from various fields such as natural language processing (NLP), data mining, and machine learning for extracting ontological elements (Maimon and Browarnik 2015) and then constructing ontologies.

Ontology Learning Layer Cake is proposed by Buitelaar et al. (2005). This approach is a dominant approach and it is considered the cornerstone in OL (Maimon and Browarnik 2015). According to Buitelaar et al. (2005), there are six layers in OL, they are: Terms, Synonyms, Concepts, Concepts Hierarchies, Relations, and Rules. Based on these layers, the process of OL can be divided into six sub-tasks as follows:

  • Term extraction is a prerequisite for all aspects of OL from text. A term is multi-word or single-word token, which denotes a specific meaning in a given domain.

  • Synonym discovery aims to find the terms that indicate the same concept and appear in the same set for a selected concept. A synonym set is the most common type lexical relations. It is possible to use a readily available set such as WordNet synsets, clustering techniques or any other similar methods (Maimon and Browarnik 2015).

  • Concept formation is considered as an unclear task because there is no consensual definition of what a concept is. The general definition of a concept is that it should include the intention of the concept, the extension of the concepts, and the lexical signs (terms) which are used to refer to these concepts.

  • Concept Hierarchy is the backbone of an OL that includes inducing, extending and refining the concepts. This task aims to build the hierarchical taxonomy of concepts.

  • Relation discovery aims to extract a novel relationship between known concepts. This task still an open problem (Gillani Andleeb 2015; Maimon and Browarnik 2015; Mishra and Jain 2015; Wong et al. 2012). There are a few approaches that address the relation extraction issue for OL from text such as the association rule extraction algorithm based on a sentence, that proposed in Maedche and Staab (2000).

  • Rule or axiom extraction is the final sub-task in the OL process. It aims to infer the rules based on extracted concepts and relations. This task is still in the initial stage and needs more efforts (Gillani Andleeb 2015; Maimon and Browarnik 2015; Mishra and Jain 2015; Wong et al. 2012; Zhou 2007). There are very little attempts to generate the rules and axioms in existing OL systems. So far, logic-based approaches can be used for this task such as in (Fleischhacker and Völker 2011; Oliveira et al. 2001).

4 Ontology learning approaches

Approaches of OL from text can be divided into two categories: linguistics-based approaches and machine learning (ML) approaches (statistic-based and logic-based approaches). The following sub-sections present the details of these approaches.

4.1 Linguistics-based approaches

  • Pattern-based extraction (Morin 1999): It is used to recognize the relations by matching a pattern from a sequence of words in the text. Lexico-syntactic patterns and semantic templates are techniques under this approach. Lexico-syntactic patterns technique uses any defined patterns such as “NP is type of NP” to extract hypernym and meronym relations. Semantic templates are similar to lexico-syntactic patterns technique but with more detailed rules and conditions and also has been used to extract non-taxonomic relations. It is well known that these approaches have reasonable precision, but they have very low recall (Maimon and Browarnik 2015; Wong et al. 2012).

  • POS tagging and sentence parsing (Abney 1997): It is considered a rule-based approach. Part-Of-Speech (POS) tagging is used to assign parts of speech to each word in the text, such as noun, verb, adjective, etc., while the sentence parser is used to recover the complete and exact parses for each sentence in the text. However, many words are ambiguous (e.g. in English, the word “plant” may be a noun or a verb), so certain parsers are built on statistical-based parsing such as the Stanford Parser (Klein and Manning 2003). Statistical Parsing based on the probability of certain tag occurring, given various possibilities. This approach is used for term extraction.

  • Syntactic structure analysis and dependency structure analysis (Gamallo et al. 2002; Nivre 2004): It is used to uncover syntactic and dependency of terms and relations at the sentence level. Syntactic structure analyses the words and modifiers in syntactic structures such as noun phrases and/or verb phrases to discover potential terms and relations while ignoring other phrases. As for dependency structure analysis, it uses grammatical relations (e.g. subject, object, and complement) to determine more complex relations. However, these approaches may be inadequate, they need to cooperate with other algorithms and/or rules for better performance (Mudhsh et al. 2015). This approach is useful for term and concept extraction, and also for relation discovery. For example, concepts can be extracted based on terms dependency within a noun phrase while relations can be extracted based on terms dependency within a verb phrase.

4.2 Machine learning approaches

ML approaches can be divided into two kinds: statistical-based approaches and logic-based approaches.

4.2.1 Statistic-based approaches

  • Co-occurrence analysis (Budanitsky 1999): It is used to identify lexical units that tend to occur together for purposes ranging from extracting related terms to discovering implicit relations between concepts.

  • Association rules (Maedche and Staab 2000): It is used to extract the non-taxonomic relations between concepts by using a small seed knowledge as background (e.g. using concept hierarchy as background).

  • Heuristic and conceptual clustering (Faure and Nédellec 1998; Faure and Poibeau 2000a): It is used to group the concepts based on the semantic distance between them to make up hierarchies. Formal Concept Analysis (FCA) is one method under this approach that uses conceptual clustering technique to provide intentional descriptions for the abstract concepts or data units (Cimiano et al. 2005; Drymonas et al. 2010).

  • Ontology pruning (Kietz and Maedche 2000): It is used to build a domain relevant ontology by using heterogeneous sources (e.g. comparing domain sources with the generic sources to determine which concepts are more relevant to the specific domain and which concepts are general).

4.2.2 Logic-based approaches

  • Inductive Logic Programming (Zelle and Mooney 1993): It is used to derive the rules from positive and negative examples of the existing collection of concepts and relations. For example, firstly, the following positive examples: “cats have fur”, “dogs have fur”, and “tigers have fur”, then “mammals have fur” are generated. After that, from negative example, “humans do not have fur”, then the generalization of “mammals have fur” will be dropped and deduces that only “canines and felines have fur”. However, this approach depends on the good predefined rules templates by the expert. For instance, if there are no good negation examples, then an invalid rule may be generated. The considerable disadvantage of this approach is that in the search process sometimes can prune searched hypothesis (Boytcheva 2002).

  • Logical inference (Shamsfard and Barforoush 2004): It is used to infer implicit relations from existing ones. For example, “Steven is a man” and “all men are mortal”, then the following relation “Steven is mortal” is inferred. However, in this approach, there is a high possibility of introducing conflicting and/or invalid relations and rules (Wong et al. 2012). For example, “human eats fish” and “fish eats the worms” potentially generate invalid new relation. In addition, it can generate only very basic relations most of the time.

To conclude this section, Table 1 shows the summary of the discussed OL approaches and OL tasks corresponding with these approaches.

Table 1 Ontology learning approaches and their corresponding tasks

5 Ontology learning and construction systems

5.1 Tools or systems

Since last 2 decades, many semi-automatic or automatic OL systems (or tools) have been developed. These systems tried to enhance the OL process to make it more effective and efficient. Park et al. (2010) divided OL into three types as follow: ontology editing tools that help ontology engineer for acquiring, visualizing, and organizing domain knowledge; ontology merging tools that combine two or more existing ontologies to construct one coherent ontology; and ontology extraction tools (it also can be called automatic ontology construction tools) that try to extract concepts and/or relations by using some NLP and/or machine learning techniques.

As ontology extraction tools play a more promising role in automating ontology construction, therefore, in this section, our discussion focus on ontology extraction tools. Table 2 shows the comparison among ten prominent automatic ontology construction systems that take into account the discovering for any type of relations (taxonomic or non-taxonomic).

Table 2 A Comparison of the Ten Prominent Ontology Learning Systems

5.1.1 Asium

“Acquisition of Semantic knowledge Using Machine learning methods” (ASIUM) (Faure and Nédellec 1998; Faure and Poibeau 2000b) is a semi-automated OL system that aims to learn the taxonomic relations. ASIUM can also be considered as an ontology editing tool because it is designed to help experts in the acquisition of semantic knowledge from technical domains. The learning method in ASIUM is based on heuristic and conceptual clustering. The basic clusters are formed by words that occur with the same verb after the same preposition (e.g. “ballpoint pen” and “pencil” are an adjunct of the verb “to write” and it may occur after the preposition “by” or “with”).

5.1.2 Text-to-Onto

Text-to-Onto (Maedche and Volz 2001) is a semi-automated system that builds a domain ontology from an initial core ontology by using data mining and NLP. Text-to-Onto uses statistical-based approaches to learn ontology. The concepts are formed by using formal concept analysis, such as co-occurrence analysis, with no additional information required. Lexico-syntactic patterns are used for the hypernym (taxonomic relations) extraction, while association rules approach is used for non-taxonomic relation extraction. POS tagging and syntactic structure analysis are used for term extraction. This system uses the pruning approach by comparing domain sources with the generic sources to determine the domain-relevant concepts. Text-to-Onto takes the German web data (e.g. HTML free text, dictionaries) as input.

5.1.3 Text2Onto

Text2Onto (Cimiano and Völker 2005) is a redesign of Text-To-Onto system. Text2Onto applies different measures such as tf-idf and c-value/nc-value to find the relevance of a term with respect to the corpus. There is no significant difference between Text2Onto and Text-To-Onto for the OL methods. User interface is friendlier in Text2Onto, e.g. the translation of the extracted ontologies to ontology languages such as OWL and RDFS is easier and the ontology experts (or users) have more control.

5.1.4 HASTI

HASTI (Shamsfard and Barforoush 2003, 2004) is considered an automatic OL system that tries to build dynamic ontologies from scratch (it uses a small kernel of primitive concepts as initial input). HASTI uses logic-based, linguistic-based and statistics-based approaches for OL process. Lexico-syntactic patterns and semantic templates are used for concept extraction. Also, semantic templates, heuristic clustering analysis and logical inference are used for taxonomic and non-taxonomic relation extraction. The limitation of the conceptual hierarchy method in this system is that each intermediate node has at most two children (Wong et al. 2007). In addition, HASTI is one of the little few systems that tries to learn axioms, it uses the inductive logic programming for axioms learning. However, these learned axioms at most are very general.

5.1.5 SYNDIKATE

SYnthesis of DIstributed Knowledge Acquired from TExts (SYNDIKATE) (Hahn and Marko 2002; Hahn and Romacker 2001) is an automatic system for acquiring knowledge from real-word text. SYNDIKATE automatically bootstraps its domain knowledge as text analysis proceeds by learning module. SYNDIKATE uses syntactic structure, dependency analysis, and semantic templates for OL. Syntactic structure and dependency analysis are used for terms and concept extraction. Semantic templates as well as dependency analysis are used for taxonomic and non-taxonomic relation extraction. SYNDICATE is more directed towards evolution and maintenance of ontologies, and other knowledge sources rather than to construct an ontology from scratch. The shortcomings are that this approach could generate too many hypotheses hence it becomes unmanageable and the calculations involved are also resource demanding. It is also somehow unclear what the actual output of the system is and how that output can be used. The result of this system may be viewed more as assistance for constructing an initial ontology rather than a finished ontology.

5.1.6 DODDLE-II

DODDLE-II (Nakaya et al. 2002) is a development of a Domain Ontology rapiD DeveLopment Environment (DODDLE) (Sekiuchi et al. 1998). DODDLE is OL system that uses an existing machine-readable dictionary (MRD) for OL, while DODDLE-II uses domain-specific English text corpus as well as an existing machine-readable dictionary (MRD) for acquiring taxonomic and non-taxonomic relationships of concepts. DODDLE-II uses co-occurrence (4-g) and association rules algorithm for extracting taxonomic and non-taxonomic relations.

5.1.7 TextStorm and clouds

TextStorm and Clouds (Oliveira et al. 2001; Pereira et al. 2000) is semi-automated OL system that aims to build a taxonomy. It consists of two main modules: TextStorm is for extracting relations between concepts and Clouds is for completing these relations and inferring rules. TextStorm and Clouds system uses logic-based and linguistics-based approaches for performing OL. In this system, logical inference approach (using binary predicates) is used for taxonomic and non-taxonomic relation extraction, while inductive logic programming approach is used for axioms learning. POS tagging and syntactic structure analysis are used for term extraction.

5.1.8 CRCTOL

Concept-Relation-Concept Tuple based Ontology Learning (CRCTOL) (Jiang and Tan 2005; Jiang and Tan 2010) is OL system that aims to extract key concepts and finds the semantic relations of these concepts. CRCTOL uses linguistics-based and statistics-based approaches for performing OL. Multi-word terms in the form of nouns and noun phrases, predefined POS tagging, and syntactic structure analysis are used for extracting the terms and concepts. Lexico-syntactic patterns and syntactic structure analysis are used for extracting the taxonomic relations. For extracting non-taxonomic relation, tuples \( < noun1 > < verb > < noun2 > \) is adopted. One of the shortcomings of this system is that the lexicon of specific terms of the domain is built and maintained manually. Another shortcoming is that this system observes only general concepts, and it ignores the whole-part relations which are likewise important in ontology constructing (Gillani Andleeb 2015). Furthermore, the resulting ontology in this system is based on domain specific documents, that makes these ontologies are not the accurate and comprehensive representation of the given domain. These ontologies may not be useful for applications with domain different than that knowledge base (Gillani Andleeb 2015).

5.1.9 OntoCmaps

OntoCmaps (Zouaq et al. 2011a, b) is OL system that tries to extract deep semantic representations in the form of concept maps. According to the authors, OntoCmaps generates rich conceptual representations in the form of concept maps and filters the important concepts and relationships. This system uses linguistics-based and statistics-based approaches for performing OL. The dependency structure analysis as well as POS tagging to represent the sentences, then the patterns are divided into conceptual patterns and hierarchical patterns. After that, the filtering metrics are used to filter the concept maps. The good thing about this system is that it does not rely on any predefined template for its semantic representation. However, this system does not specify formally the OL requirements when designing the system (Herrera 2014), which is considered one of the shortcomings of this system. In addition to this shortcoming, this system includes little documents details of OL process supporting and there is also lack of ontology evaluation details (Herrera 2014).

5.1.10 ProMine

ProMine (Gillani Andleeb 2015) is a semi-automated OL system for business processes/organizations on domain ontology The main aim of this system is to extract the semantic concepts that are most relevant for a domain. ProMine system includes three steps: first is extracting the terms and concepts, then these extracted concepts are filtered to find most relevant terms of a domain, and the last step is to build semantic concept categorization. This system uses linguistics-based and statistics-based approaches for performing OL. Linguistics techniques such as POS tagging and frequency count are used to extract concepts, then filtering measures are applied. After that, the shortest path length and depth of concept methods are used to build the semantic concepts categories. Lastly, semantic lexical databases are used to enrich these semantic concepts categories. According to the authors, the limitation of this system is that it just extracts two words (compound words) concepts, but three words can also possibly represent a concept. This limitation will be followed up in the next iteration of ProMine system. The other limitation of this system is that it does not take into account the non-taxonomic relation extraction.

5.2 Evaluation metrics

This section presents five important evaluation measurements for an OL system. Precision, Recall and F measure are the main performance measures that are used for evaluating an OL system. Most of the existing OL systems use these measures. Precision (P) means the number of selected items that are relevant. While recall (R) means the number of relevant items that are selected. F-measure (harmonic mean of precision and recall) is an aggregated performance score for evaluating the algorithms and systems.

$$ P = \frac{TP}{TP + FP} $$
(1)
$$ R = \frac{TP}{TP + FN} $$
(2)
$$ F = 2 \times \frac{P \times R}{P + R} $$
(3)

where \( TP \) is the true positive, \( FP \) is the false positive and \( FN \) is the false negative. True positive means the number of extracted (selected) items that are relevant. False positive means the number of extracted items that are irrelevant. While false negative means the total number of the relevant items.

Furthermore, there are two additional measures which missed in the evaluation process for most existing OL systems. Ontological Improvement (\( OImp \)) is a measure that accounts the newly discovered concepts which are absent in the benchmark, and Ontological Loss (\( OLoss \)) is a measure that accounts the concepts which exist in the benchmark, but they were not discovered. These two additional measurements are suggested in (Sabou et al. 2005) and they are defined as the following:

$$ OImp = \frac{{\left| {C_{d} \backslash C_{m} } \right|}}{{\left| {C_{m} } \right|}} $$
(4)
$$ OLoss = \frac{{\left| {C_{m} \backslash C_{d} } \right|}}{{\left| {C_{m} } \right|}} $$
(5)

where \( C_{d} \) is the discovered concepts and \( C_{m} \) is the recommended concepts.

5.3 Comparison of learning elements and approaches

Table 2 presents a comparison between ten prominent OL tools that consider relation extraction. This comparison includes the input type, the learned elements, and the used approaches for each learned element. It also includes the evaluation metrics that have been used to evaluate the system, R refers to recall measure, P refers to precision measure, and F refers to harmonic mean measure. As shown in Table 2, ASIUM and TEXT-TO-ONTO do not offer an overall figure of recall and precision for the whole OL system. HASTI, CRCTOL and OntoCmaps offer recall and precision for some sub-tasks but not for the whole OL systems. TextStorm and Clouds report the average figure of correctness. It has been assumed that this result represents the F measure (harmonic mean). In addition, this comparison also presents the output format of the ontology, as well as the level of human intervention in the systems.

6 Shallow versus deep learning for ontology construction

In this paper, the term of shallow learning system for OL refers to the systems of OL that use traditional ML and/or traditional (artificial) neural networks (ANNs/NNs). While shallow learning term refers to ANNs and DL term refers to deep neural networks.

6.1 Shallow learning

Based on the after-mentioned literature in existing shallow OL approaches and systems, it can be concluded that the existing approaches and systems of OL have many shortcomings and drawbacks. For example, most of these systems depend on human intervention in the whole or most of their tasks (e.g. concept formation, relation discovery etc.) such as Text-to-Onto, Text2Onto, and TextStorm and Clouds systems. In addition, most of existing shallow learning systems do not infer new relations or axioms, and some of them (e.g.Text2Onto and CRCTOL) use predefined templates for taxonomic and non-taxonomic relation extraction that lead to very low recall results.

In addition, most of these shallow learning systems of OL work with a small dataset and/or only the domain’s dataset. For instance, Text-to-Onto used only 21 web articles as the input dataset, while CRCTOL based on selected documents of the domain, so the constructed ontology was not comprehensive and accurate representation of the given domain (Gillani Andleeb 2015).

The main challenges faced by OL systems are the relation discovery and axiom learning. There are many works in information extraction (IE) field that try to extract the semantic relations between pairs of concepts or named entities. The following sub-sections explore some of the important recent advancements of shallow systems in the relation discovery and axiom learning tasks.

6.1.1 Relation discovery

Sombatsrisomboon et al. (2003) suggested a simple method for discovering taxonomic relations between pairs of terms by using search engines. This study used only the pattern “NP is a/an NP” for searching. However, this proposed method often fails in acquiring hypernyms of a general noun as the authors stated (Sombatsrisomboon et al. 2003). Specia and Motta (2006) introduced a hybrid of existing tools that used data mining and linguistic techniques for extracting the semantic relationship between pairs of named entities.

Sánchez and Moreno (2008) proposed unsupervised methods to discover non-taxonomic relations by using domain-relevant verb phrases to learn domain patterns, and then using statistical and linguistic analysis together with the learned domain patterns to extract non-taxonomic relations. Liu et al. (2008) developed a technique named Catriple. Catriple is a system for extracting triples automatically by using Wikipedia’s categorical system. In this approach, syntactic rules and sentence parsers were used to extract the explicit values and attributes of the category names.

Suchanek et al. (2007) developed a new ontology named YAGO (Yet Another Great Ontology). YAGO is an ontology that is built on top of both WordNet and Wikipedia. In this study (Suchanek et al. 2007), the authors used the fact that Wikipedia had category pages (lists of articles that belong to a specific category) rather than using information extraction methods to leverage the knowledge of Wikipedia. In 2007, YAGO contained more than 1 million entities and 5 million facts. The semantic relations included the Is-A hierarchy and non-taxonomic relations between entities. However, it only used structured data for building the ontology. The main disadvantage of this approach is that the space can be wasted if not all arguments of n-array facts are known. YAGO was succeeded by YAGO2 at 2012. YAGO2 was an ontology that based on WordNet, Wikipedia, as well as GeoNames (a geographical database that contained over eight million place names and it covers all countries). YAGO2 contained more than 10 million entities and more than 120 million facts.

Etzioni et al. (2008) and Zhang et al. (2016) used the Conditional Random Fields (CRFs) for extracting attributes or information. In (Etzioni et al. 2008), the authors developed TextRunner tool for extracting information across different domains from the Web based on CFRs model. Zhang et al. (2016) proposed Simultaneously Entity and Relationship Extraction (SERE) model based on CFRs to extract binary relationship from unstructured text. The authors combined between IOB2 (named entity tags) and some defined tags to build the training file. The authors stated that this study focused only on the relationship between the two named entities in one sentence without considering the relationship between the named entities across all sentences.

El-Kilany et al. (2017) proposed an unsupervised clustering-based relation extraction method to construct a dataset of relations and then used these constructed extraction templates in order to generate and extract the readable relations from collections of news data. This study used Stanford Named Entity Tagger to extract the named entities under three types of entities (person, organization, and location). Then HITS algorithm was applied to get the importance of actors’ entities and sentences. The authors concluded that if a score of high importance was at least 75% then two actors’ entities in a sentence had a relation. In this case, the Stanford Parser was used to pare each sentence with a relation for producing the relevant dependency graph. Then the shortest path nodes between every two entities was used to find if the sentence matched one of the extracted templates. This proposed method gave better results in recall measure (R) but it gave less results in precision measure (P) compared to traditional methods.

Another useful study which related to OL was Herrera (2014). This study aimed to improve the knowledge management (KM) process based on OL. The suggested method in this study used the previous developed ontologies, database, and documents to recover the demanded knowledge through the OL process. This study was more aligned with ontology enrichment and populating rather than ontology constructing.

Despite all research works and efforts of shallow analysis in relation discovery, the results are still less than satisfactory and it needs more efforts to improve it (Albukhitan et al. 2017; Chen et al. 2010; Maimon and Browarnik 2015; Wong et al. 2012; Zhang et al. 2016; Zhou 2007).

6.1.2 Axiom learning

From all studies of the OL that have been referenced in this study, which include more than 40 OL approaches and systems, there are only few methods for axiom learning. Lin and Pantel (2001) showed some of the extracted similarities corresponded to inverse relations such as \( author\_of \) and \( written\_by \) to find the similar dependency tree paths, which could be used to axiomatize the meaning of some relation. Shamsfard and Barforoush (2004) suggested deriving axioms from conditional or quantified sentences such as ‘All babies need milk’. Thus, can simply be used as a basis to define general rules. In (Oliveira et al. 2001; Pereira et al. 2000), the authors proposed a method to produce the rules by Is-A relation (such as \( Is - A(X, vegetarian): - eat(X, vegetables) \)) and/or property (such as \( property(X, friendly): - \)\( property(X, small) \)\( , have(X, fur) \)). However, as mentioned above, producing the rules based on the property often are invalid (Wong et al. 2012).

In the (Fleischhacker and Völker 2011) study, the authors proposed inductive methods of enriching engineered ontologies with generating disjointness axioms. These methods used semantic similarity (statistical correlation analysis to rate the strength of linear relationships between two value sequences) and the association rule mining techniques for learning disjointness axioms. It also outlined the ideas underlying two alternative methods supporting the discovery of negative association rules. The experiments of this study showed that it was possible to induce disjointness axioms from an existing knowledge base with some accuracy, but there were two sources of errors. The first one was that the disjointness axioms determined by the automatic learning process can be incorrect, the other one was the incorrect, explicit or implicit \( rdf:type \) assertions.

Völker et al. (2007) proposed a semi-automatic ontology engineering method for automatically generating formal class descriptions from natural language definitions extracted (from Wikipedia definitions and from a fishery glossary which was provided by the Food and Agriculture Organization of the United Nations). The implementation of this method was based on a syntactic transformation of natural language definitions into OWL DL axioms in line with lexico-syntactic patterns. This proposed method had many limitations and shortcomings according to the authors themselves. For example, the parser which was used in this method failed to deliver a parse particularly for the structurally complex or ill-formed sentences, as well as there were several problems apart from the quality or efficiency of the syntactic analysis, which concerned semantic ambiguity related to quantifier or homonymy.

Mathews and Kumar (2017) proposed a new Controlled Natural Language (CNL) called TEDEI (TExtual DEscription Identifier) to generate corresponding axioms. This method based on grammar-based syntactic transformation. However, as the authors themselves stated, this proposed approach could not handle any sentence that was not guided by a grammar. In addition, only one formalization per sentence was generated without taking into account the impact of ambiguity in the formalization.

As it is observed from mentioned literature, the extracting rules or axioms from unstructured or even semi-structured data is really hard task (Gillani Andleeb 2015; Mathews and Kumar 2017) and it depends on the level of precision and recall of the concept and relation extraction tasks.

6.2 Differences between shallow learning and deep learning

DL is considered the second generation of ANNs (Mo 2012). Traditional ANN models have shallow-structure architectures, these architectures typically contain a single layer of nonlinear feature transformations and they lack multiple layers of adaptive non-linear features (Deng 2012; Deng and Yu 2014). These shallow architectures can effectively solve many simple or well-constrained problems, but their modelling and representational power are limited (Deng 2012). Hence, for more complicated real-world applications such as human speech, natural language, and natural image; these shallow architectures face many difficulties in dealing with these problems (Deng 2012; Deng and Yu 2014; Zouaq 2011). Whereas deep architectures produce better results when dealing with these complicated problems.

DL architecture can learn representations and features directly from the input with a little or no prior knowledge. This learning about features is a promise to get rid of features engineering which use other shallow learning architecture in the future (LeCun et al. 2015). In other words, the main distinguishing feature of DL from shallow learning is that DL derives its own features directly from data (feature learning), while shallow learning relies on handcrafted features based upon heuristics of the target problem. That means DL is more able to take advantage of increases in the amount of available computation and data (LeCun et al. 2015).

6.3 Deep learning techniques

There are three learning types of DL networks: (Deng 2012; Deng and Yu 2014) (1) unsupervised or generative: when no target labels data are available; (2) supervised or discriminative: when target label data are always available in direct or indirect forms; and (3) hybrid: when discriminative criteria for supervised learning are used to estimate the parameters in any of the deep generative or unsupervised deep networks. These DL networks can be feed-forward or recurrent networks. In the feed-forward networks, there is no connection between the neurons in the same layer. While in the recurrent networks, there may be connections between the neurons in the same layer. Figure 1(a) shows an example of a feed-forward network structure while (b) shows an example of a recurrent network structure.

Fig. 1
figure 1

Examples of feed-forward and recurrent networks structure

DL models use a deep graph with multiple processing layers, composed of multiple linear and non-linear transformations to model high-level abstractions in data (Deng and Yu 2014). There are four main DL models: Recurrent neural networks (RNNs), Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs) and Autoencoder Networks.

RNNs are unsupervised or supervised learning that are used to predict the future data sequence by using the previous data samples (Basegmez 2014; Deng and Yu 2014). For language modelling tasks that involve sequential inputs, RNNs have a good performance, but training RNNs have proved to be problematic (so hard to train) (Basegmez 2014; LeCun et al. 2015). If the goal is to predict the next word from the words previous given, then RNNs usually work better (Le 2015).

Convolutional Neural Networks (CNNs) and Deep Belief Networks (DBNs) (with their respective variations) are the two milestones in the field of DL (Arel et al. 2010; Chen and Lin 2014; Mo 2012). CNNs are multi-layer, supervised, and feed-forward networks. While DBNs are unsupervised or supervised, and feed-forward networks. CNNs have been used in wide applications in image and video recognition such as in (Basegmez 2014; Kuang et al. 2018) and a little application of NLP such as in (Kim 2014). Also, DBNs have been used in many applications in image and video recognition such as in (Fischer and Igel 2012; Hinton et al. 2006; Huang et al. 2007) and in wide application of NLP such as in (Chen et al. 2010; Hinton and Salakhutdinov 2006; Salakhutdinov and Hinton 2007; Sarikaya et al. 2014; Zhong et al. 2016). In current literature of DL, it is observed that CNNs have performed better than DBN on the benchmark computer vision datasets such as MNIST, but if the dataset is not a computer vision, then DBNs can most definitely perform better (Deng 2012; Deng and Yu 2014; Hu et al. 2016).

Autoencoder network is an unsupervised or supervised feed-forward network that it is a closely related approach for DBNs (Hinton 2009). It is a network that its output target is the data input itself for compression objective as in (Le 2015) or denoising objective as in (Baldi 2012). Table 3 shows the summary of the RNNs, CNNs, DBNs, and Autoencoder Networks.

Table 3 Summary of the four main DL models

There are many other attempts to build DL models and techniques, for example, as in (Cohen 2005) and in (Deng and Yu 2011). In (Cohen 2005), the author built stacks of CRFs to develop a sequential classification when the segmentation information was unknown in the training data. While in (Deng and Yu 2011), the authors built a Deep Stacking Network (DSN) that also named Deep Convex Network (DCN). This model was developed for large-vocabulary speech recognition. Other example is as in (Albukhitan et al. 2017) and in (Arguello Casteleiro et al. 2017). The authors of these studies joined the Continuous Bag of Words (CBOW) model with a Skip-gram model to extract concepts and relations. CBOW and Skip-gram were machine learning models that each one of them had only one hidden layer. The authors called this a deep learning, because they used two models together and also the training of CBOW was little like the way of training DL models even if it had a single hidden layer.

6.4 From shallow learning to deep learning trend

The major problem in existing OL systems is the problem of language understanding by machine using shallow processing for text (Zouaq 2011). This section discusses the reasons for most of the recent techniques for NLP moved from shallow techniques toward the deep techniques.

As it is observed from most recent research, the DL gives better handling for several NLP tasks versus shallow learning systems (Chen et al. 2010; Petrucci et al. 2016; Zhong et al. 2016; Zouaq 2011) as well as DL can handle large amount of data in an effective and efficient way. This would be more useful either in the most foundational NLP tasks such as POS tagging, or semantic role-labelling to the most complex tasks such as sentiment analysis, question-answering, OL, and machine translation (Grefenstette et al. 2014; Petrucci et al. 2016; Sarikaya et al. 2014).

Moreover, based on the state-of-the-art research results of NLP with DL, such as in (Albukhitan et al. 2017; Chen et al. 2010; Petrucci et al. 2016; Wang et al. 2018; Zhong et al. 2016), it can be concluded that using DL models for several NLP tasks can give better results rather than using traditional ANNs.

Furthermore, there are many advantages for using a DL rather than ANNs. One of DL advantages is that it can process unstructured data efficiently (Najafabadi et al. 2015). The other one of DL advantages is that it can learn prior knowledge from the input data and support different learning paradigms (Mo 2012). Also, it can extract the features from labelled or unlabelled data without the need to engineer these features; and it is more efficient in extract relationships and pattern in the data (Bengio and LeCun 2007; Najafabadi et al. 2015). Refer to Sect. 7 for more explanation and clarification.

6.5 Deep learning and ontology construction

It is noteworthy that in the last few years, there was a significant move towards deep analysis. The deeper techniques give better handling for text understanding and knowledge engineering versus shallow techniques. For example, Chen et al. (2010), and Zhong et al. (2016) used the DL model for extracting the attributes of named entity. In (Chen et al. 2010), the authors proposed an information extraction model to find and pair one of five types of relationships (Role, Part, At, Near, and Social) between each two named entities (Chinese entities) based on DBN, they used ACE 2004 Chinese dataset which have data of named entities under five type of entities (person, organization, GPE, location, and facility). While in (Zhong et al. 2016), the authors proposed unsupervised method based on DBN that extracted the Chinese named entity attributes for only persons, locations and organizations entities; this method called Entity Attribute Extraction Based on Deep Belief Network (EAEDB). This proposed method used the CRFs to extract the named entities and then used the proposed DBN to extract their attributes. The training file was built by combining between IOB (named entity tags) and POS tags. Based on the experiments in (Chen et al. 2010) study, using DBN gave better results rather than using Support Vector Machine (SVM) and traditional Back Propagation Neural Network (BP-NN).

In additions, the authors in (Albukhitan et al. 2017) designed a framework for OL from Arabic text. They used CBOW model with Skip-gram model to construct the words representation in vectors space and then extract the taxonomic relation extraction. The results of this study showed that this method gave better performance rather than traditional methods. Another study which uses the same technique is in (Arguello Casteleiro et al. 2017). It used CBOW and Skip-gram to extract biomedical terms, concepts, and relations from PubMed data regarding to sepsis domain.

Moreover, in the (Petrucci et al. 2016) study, the authors supposed that the OL could be a transductive reasoning process. The intention of the transductive process was to convert knowledge from natural language (source language) into a logic-based specification (target language). The source language in this study was English and the target language was OWL. This supposed transduction process for sentences was divided to two phases: sentence transduction phase and sentence tagging phase. The RNN model was used for this process. The main limitation of this study was that the model had been trained and evaluated on a limited amount of data, that means it was just modelling a limited portion of natural language in a sentence-by-sentence. In addition, converting one sentence to be one axiom maybe not compatible with large data or with a different sample of collected data. Despite that, the results of this study gave a good evidence of the potential of DL towards long-term OL challenges.

Furthermore, in the (Wang et al. 2018) study, the authors used the CNN model to classify the text to set and then using this classified text and the TF-IDF matrix to construct the ontology for the shipping industry domain. This approach showed a high classification accuracy result that led to improve the ontology construction. However, the ontology construction framework in this study was built by experts.

In the (Wang 2015) report, the author summarized the recent advances in both DL and semantic data mining. In addition, he gave some explanation about how the DL can be capable for constructing better data representation for the machine. One of the essential discussed subjects in this report is knowledge bases and ontology building and how could DL techniques be applied to bridge semantic gaps (knowledge gaps) between the data, applications, data mining algorithms, and data mining results. At the end of this report, the author showed his intentions and thoughts on addressing the deep data representation with ontology. The author called his approach Deep Learning Ontology (DLO) which based on DBN. This DLO approach was considered an ontology-based DL framework and it sought to formally encode the concepts and three types of relations between these concepts (subclass, disjoint, and coexists) in the domain of the data label.

There are much other research that showed how DL can improve the text analysis and knowledge representation such as in (Collobert and Weston 2008; Neelakantan 2017) studies. In (Neelakantan 2017), the authors developed a new DL model called Neural Programmer (based on RNN) for knowledge representation and reasoning. While in (Collobert and Weston 2008), the authors proposed a general DL architecture for NLP tasks (e.g. POS tagging, named-entity recognition, language modelling and semantic role-labelling). As well as, in the (Hassan and Mahmood 2018) study, the authors designed a joint CNN and RNN framework for sentences classification, they used the recurrent layer as a substitute for the pooling layer. This framework takes advantage of the encoded local features extracted from the CNN model and the long-term dependencies captured by the RNN model. In addition, in the (Chicco et al. 2014) study, the authors used deep autoencoder model to predict the novel gene functions and annotations for creating and enriching gene database. The result of this study showed that using autoencoder performed better than traditional systems of gene function prediction.

Based on state-of-the-art studies on DL and NLP, we can conclude that addressing the DL for OL and other NLP tasks have promising results.

The ontology construction process is combined between NLP and data mining techniques. Table 4 shows the summarization of the most relevant studies, that are mentioned in this research, which using DL for any main task of ontology construction. The tasks of the ontology construction process are divided into three main tasks for this summarization, they are: Term Extraction which includes the term, concept extraction and/or their similarities; Relation Discovery which includes attribute or relation extraction and/or classification; and Axiom Learning which includes axiom and rule extraction and/or prediction.

Table 4 The summary of deep learning studies for ontology construction tasks

7 Automatic ontology construction by deep learning

According to the ability of DL to represent the data and learn the features directly from input data because of the hierarchical methods of DL, the features can be distributed in deferent layers and the level of the abstraction is increased while the input data is processed at each layer. For example, in the object recognition the lowest layer is pixels, the higher layers are edges, patterns, parts and finally the highest layer is recognized object.

Based on the referred ontology construction challenges (Sect. 2.2), decreasing the human intervention in the ontology construction process is one of the major challenges of the ontology construction systems. The most of these systems are cooperative (see Table 2), one of the advantages of DL is that it can learn features directly from input data (no more need to manually features engineering). This naturally reduces the human intervention. In addition, the results of concept and relation classification in the most exist ontology construction systems are still less than satisfactory. For more explanation, the concepts and relations are different from domain to other. There are no certain concepts and relations that can be expected for the domain, especially for a new domain area. The shallow learning systems for this type of problem is less effective and efficient in contrast to DL which promises to provide better performance for this type of problems.

7.1 Deep learning to ontology learning process

The ontology construction techniques are considered a compound of NLP techniques and data mining techniques. Based on our knowledge and related literature we can give a general vision of where the DL can be used to enhance the OL process. The problem of existing OL systems can be considered the problem of language understanding by machine. The text is an unstructured data and sometimes the sentences are complicated, the current shallow learning systems cannot handle these complicated sentences efficiently while the hierarchical technique of DL promise to efficiently process for unstructured data by learning new features and representing the data for more language understanding by machine.

For example, in the pre-processing and concept extraction phases of text, the appropriate DL model can be used for POS tagging or for semantic role-labelling and for semantic-syntactic parsing. Where most of the OL approaches are founded on the utilization of the syntactic analysis to further extract relevant structures (e.g. concept extraction, depends on noun phrase). So using DL to build deeper analysis for sentences structures will enhance the sentences understanding and relevant structure extraction. Then the relevant concepts and relations, which depend on the syntactic analysis, will be better in precision and recall.

In addition, as it is mentioned previously, there are no certain concepts and semantic relations because they are different from domain to other, and axiom learning depends on the concepts and relations. For the relation discovery and axiom learning phases, the appropriate DL model, which can learn new features of concepts and relations for classifying and representing them, can be used for building an appropriate trained model by using the pre-processed corpus as input. After that, this trained model can be used to classify the concepts and to extract the semantic relations; then to infer the rules based on classified concepts and extracted relations. Where DL can predicate and learn prior knowledge as well as extract the features without the need, for example, to static templates to engineer these features. For more explanation, the following example and figures (Figs. 2 and 3) give a general vision of DL network and how to classify the concepts by DL.

Fig. 2
figure 2

An example of DL network

Fig. 3
figure 3

An example of how to use DL for building the concept classifier

Figure 2 presents an example of DL network that has an input layer (\( V \)), three hidden layers (\( h0, h1, h2 \)), and output layer (\( O \)). \( X \) refers to input data. The dotted arrows represent the parameters for inferring samples from the posterior distribution at each hidden layer of the network. Figure 3 presents an example of how to build the concept classifier (trained model) through DL. This example gives a general vision of how DL classifier does regardless the details of DL model and its algorithms.

Now let us suppose that we have the following concepts of fishes’ domain: hagfish, lampreys, sharks, sawfishes, and pelagic fish correspond the labels; while fishes, jawless fishes, cartilaginous fishes, and bony fishes are the other concepts that are defined through the domain knowledge. Note that the example in Fig. 3 uses a top-down method to build the concept classifier (trained model), then this trained model can be used to classify the concepts in the bottom-up method (jawless fishes, cartilaginous fishes, and bony fishes are considered as the labels).

Approximating any random function with a single hidden layer is still possible. However, learning through multiple layers (DL) is easier and faster (Basegmez 2014; LeCun et al. 2015).

7.2 Deep automatic ontology construction system

Based on all of the above, we can be certain that building a deep automatic ontology construction system has promising results in terms of effectiveness, efficiency and reducing time and effort. Different DL models can be used in different stages of the ontology construction process. In other words, it is possible to build a different DL model for each different task in the ontology construction process. However, the main challenge of using the DL for ontology construction is to build an appropriate deep network and pick up the right method according to the particular task.

It is worth mentioning that despite all the promising results of using DL for ontology construction in the mentioned researches at this paper, but there is no research that uses the different DL models for the same task of the ontology construction process to compare and show which the DL model that is more appropriate for the selected task. That is what should be regarded in the future researches of deep automatic ontology construction.

8 Conclusion

This paper reviews, discusses, compares and criticizes different approaches, and systems of ontology construction from text. From the outcomes of reviewing existing ontology construction systems and approaches, this paper reveals that there is a consensus of the mentioned aspects of automatic ontology construction challenges that require more efforts. In brief, relation discovery, axioms learning, human intervention reduction, small/large scale input data transformation and standard platform building for ontology construction systems evaluation are considered the major challenges for an ontology construction system. Following that, this paper presents the rationale to move towards DL rather than traditional methods for OL. Then some research regarding to OL and DL are presented. Finally, this paper gives a vision of how and where the DL can be applied in ontology construction process.

We can summarise the key issues that will likely define the research directions in the near future of using DL for OL, as follows: using DL to build deeper sentences structures analysis (e.g. deep semantic-syntactic parser); using DL to classify concepts and relations; using DL to resolve the issues of learning and inferring relations and axioms.