Keywords

1 Introduction

Ontology in computer science can be viewed as formal representation of knowledge pertaining to particular domain [18]. In simpler terms ontology provides concepts and relationship among concepts in a domain. Machines perceive contents of documents (blogs, articles, web pages, forums, scientific research papers, e-books, etc.) as sequence of character. Much of the semantic information are already encoded in some form or other in these documents. There is an increasing demand to convert these unstructured information into structured information. Ontology plays a key role in representing the knowledge hidden in these texts and make it human and computer understandable.

Construction of Domain Ontology provide various semantic solution including: (1) Knowledge Management (2) Knowledge Sharing (3) Knowledge Organization (4) Knowledge Enrichment.

It can be effectively used in semantic computing applications ranging from Expert Systems [16], Search Engines [22], Question and Answering System [7], etc. to solve day to day problems. For example, if the search engine is aware that “prokaryote” is a type of organism, better search results can be obtained and recall of the system will be improved subsequently.

Ontology is generally built under the supervision of domain experts and are time intensive process. Corpus required for building Ontology are not always readily available. Therefore, it is important to build corpus from web through crawling. Very few work is available that have incorporated crawling as a phase for collecting corpus in building Ontology. Since general crawling does not always provide domain related pages, lot of irrelevant pages are downloaded and filtering is required. Terms extracted using statistical measure or linguistic patterns are prone to noise and require additional level of filtering using machine learning techniques. Also, most systems rely on manually annotated resources for obtaining terms and also for relation discovery. These resources however mostly contain domain generic concepts and lack domain specific concepts and relations [18]. Ontology extracted using lexico-syntactic patterns are limited to certain patterns and require enrichment.

In this work we propose a framework for crawling websites relevant to the domain of interest and also build Domain Ontology without use of any annotated resource in an unsupervised manner. The crawling framework uses a novel weighting measure to rank the domain terms. The proposed framework consists of five phases Corpus Collection, Term Extraction, Taxonomic Relation Extraction, Non-taxonomic relation extraction and Domain Ontology building. Corpus is crawled using iterative focused web crawler which downloads the content which are pertinent to the domain by selectively rejecting URL’s based on link, anchor text and link context. Terms are extracted by feeding graph based algorithm HITS with Shallow Semantic Relations and proposed use of adjective modifiers to obtain fine grained domain terms. Hearst pattern and Morpho-Syntactic patterns are extracted to build taxonomies. Non-taxonomic relation extraction is obtained through Association Rule Mining on Triples.

The organization of the paper is as follows: Sect. 2 describes Related Work, Sect. 3 describes the System Design, Sect. 4 describes the Results and Evaluation, Sect. 5 describes the Conclusion and Future Work.

2 Related Work

In this section, we discuss the literature survey in Corpus Collection, Term Extraction, Taxonomic Relation Extraction and Non Taxonomic Relation Extraction.

2.1 Domain Corpus Collection

Domain Corpus is a coherent collection of domain text. It requires the usage of iterative focused or topical web crawler to fetch the pages that are pertinent to the domain of interest. In the work proposed by [6], a heuristic based approach is used to locate anchor text by using DOM tree instead of using the entire HTML Page. A statistical based term weighing measure based on TF-IDF called TFIPNDF (Term Frequency Inverse Positive Negative Document Frequency) was proposed for weighing anchor text and link context. The pages are classified as relevant or not relevant on the basis of trained classifier and is entirely supervised. The work however lacks iterative learning of terms to classify pages [15].

2.2 Domain Term Extraction

Domain Terms are the elementary components used to represent concepts of a domain. Example of domain terms pertaining to agricultural domain are “farming”, “crops”, “plants”, “fertilizers”, etc. Term Extraction is generally performed from collection of domain documents using any of the following methods: Statistical Measure, Linguistic Measure, Machine Learning and Graph-based Measure.

Statistical Measure. Most common Statistical Measure make use of TF (Term Frequency) and IDF (Inverse Document Frequency). Meijer et al. [9], proposed four measures namely Domain Pertinence, Lexical Cohesion, Domain Consensus and Structural Relevance to compute the importance of terms in a domain. Drymonas et al. [3], used C/NC values to calculate the relevance of multiword terms in corpus. These measures however fail to consider the context of terms and fails to capture the importance of infrequent domain terms.

Linguistic Measure. Linguistic Measures traditionally acquire terms by using syntactic patterns such as Noun-Noun, Adjective-Noun, etc. For example, the POS tagging of the sentence “Western Rajasthan and northern Gujarat are included in this region” tags “Western” as an adjective and “Rajasthan” as Noun. Lexico-Syntactic patterns makes use of predefined patterns such as “including”, “like”, “such as”, etc., to extract terms. It is however tedious and time consuming to pre-define patterns.

Machine Learning. Machine Learning is either supervised or unsupervised. Supervised learning require the algorithm to be trained before usage and target variable is known. Some famous and commonly used supervised algorithms include Naive Bayes, Support Vector Machines and Decision Tree. In unsupervised learning training is not required and hidden patterns are found using unlabeled data. Uzun [21] work considers training features are independent and therefore used TF-IDF, distance of the word to the beginning of the paragraph, word position with respect to whole text and sentence and probability features from Naive Bayes Classifier to classify whether a term is relevant. The drawback of using machine learning is that training incurs overhead and data may not be available in abundance for training.

Graph Based Measure. Graph Based Measure is used to model the importance of a term and the relationship between the terms in an effective way. Survey on Graph Methods by Beliga et al. [1], suggest that graphs can be used to represent co-occurrence relations, semantic relations, syntactic relations and other relations (intersecting words from sentence, paragraph, etc.). Work by Ventura et al. [8] used novel graph based ranking method called “Terminology Ranking Based on Graph Information” to rank the terms and dice coefficient was used to measure the co-occurrence between two terms. Mukherjee et al. [10] used HITS index with hubs as Shallow Semantic Relations and authorities as nouns. Terms are filtered based on hubs and authority scores.

2.3 Taxonomic Relation Extraction

Taxonomy construction involves building a concept hierarchy in which broader-narrower relations are stored and can be visualized as a hierarchy of concepts. For example “rice”, “wheat”, “maize” come under “crop”. They are commonly built using predefined patterns such as the work by Hearst [4] and Ochoa [12] et al. Meijer et al. [9] proposed construction of taxonomy using subsumption method. This method calculates co-occurrence relations between different concepts. Knijff et al. [2], compared two methods subsumption method and hierarchical agglomerative clustering to construct taxonomy. They concluded that subsumption method is suitable for shallow taxonomies and hierarchical agglomerative clustering is suitable for building deep taxonomies.

2.4 Non Taxonomic Relation Extraction

Non Taxonomic Relations best describe the non-hierarchical attributes of concept. For example, in the non taxonomic relation “predators eat plants”, eat is a feature of predator. Nabila et al., [11] proposed an automatic construction of non-taxonomic relation extraction by finding the non-taxonomic relations between the concepts in the same sentence and non-taxonomic relations between concepts in different sentences. Serra and Girardri [14] proposed a semi-automatic construction of non-taxonomic relations from text corpus. Association between two concepts are found by calculating the support and the confidence scores between the two concepts.

To build a Domain Ontology from Text, the existing methods for Domain Term Extraction deprive from identification of low frequent terms, identification of all syntactic-patterns and require annotated re-sources for machine learning approaches. Graph based methods for identification can be used to solve the above problems as they can represent the meaning as well as composition of text. They also do not require manually annotated data unlike machine learning approaches. General Non-Taxonomic Relation Extraction methods are based on extraction of predicates between two concepts and as all predicates are not domain specific the use of Data Mining Techniques can be helpful in identifying the Domain Relations effectively.

3 System Design

In this section we discuss the design of our system. Figure 1 shows the overall architecture diagram of the proposed framework. The system consists of five major phases: (1) Domain Corpus Collection (2) Domain Term Extraction (3) Taxonomic Relation Extraction (4) Non Taxonomic Relation Extraction and (5) Domain Ontology Building.

3.1 Domain Corpus Collection

Corpus required for construction of Ontology may not be readily available for every domain. Since the quality of the corpus plays a vital role in deciding the quality of Ontology, Iterative Focused Crawling is performed to download web pages relevant to the domain. List of Seed URLs are given as input to the Iterative Focused Crawler. The web pages whose URL, anchor text or link context satisfy the relevance score are added to the URL queue. The depth of the pages to be crawled is specified. The output of the focused crawler is used as corpus for construction of Ontology. Crawling is terminated when the relevance of URL to the context vector decreases drastically. The architecture of crawler is depicted in Fig. 2.

Fig. 1.
figure 1

Architecture of proposed framework: unsupervised domain ontology construction from text

Fig. 2.
figure 2

Flow diagram of iterative focused crawler

Nouns are considered as candidate terms for finding keywords in the domain. Therefore, the nouns are extracted from the corpus using the Stanford parts-of-speech tagger. The context vector of a noun is computed by using proposed weighted co-occurrence score. Weighted co-occurrence (\(WCO({w_i}, {w_j })\)) of two words \({w_i}\) and \({w_j}\) is given by:

$$\begin{aligned} WCO( w_i,w_j )= CO( w_i, w_j ) X idf(w_i ) X idf(w_j) \end{aligned}$$
(1)

In Eq. 1, \(idf({w_i})\) and \(idf({w_j})\) are the inverse document frequency of words \({w_i}\) and \({w_j}\). \(CO({w_i},{w_j} )\) is the co-occurrence frequency of the two words \({w_i}\) and \({w_j}\). The proposed equation considers the inverse document frequencies of the terms in order to consider the importance of terms which occur rarely and may of importance to the domain. Unit Normalization of the context vector is performed to have a specific range of score between 0 and 1. The normalized context vector of each term is summed along the column and sorted in descending order. The top ranked terms are extracted as concepts based on percentage.

Relevance of the web pages are calculated by computing the average of the Cosine Similarity Score of the test domain vectors and each of the domain vectors of the training document. The relevance of the URL is checked without scanning the pages. It is done by computing relevance of HREF, Anchor Text and/or Link Context. Appropriate threshold are set for HREF, Anchor Text and Link Context. If HREF is not relevant (i.e. Relevance Score), Anchor Text will be checked for relevance. If Anchor Text is not relevant, finally, Link Context will be checked.

3.2 Domain Term Extraction

Domain corpus, which contains a rich collection of text documents is pre-processed to identify the domain terms. Numbers, special characters, etc. which do not play a significant role in ontology construction are removed.

Shallow Semantic Relation Extraction. Domain text documents are tokenized into sentences. These sentences are parsed using Stanford Dependency Parser to identify the Shallow Semantic Relations between the words. Shallow Semantic Relations represent the syntactic contextual relations within the sentences. In addition to the Shallow Semantic Relations extracted in [10] we have also extracted and used adjective modifiers obtained through Dependency Parsing. Since, significant amount of domain terms are composed as adjective modifier, it is important to consider these dependencies. For example, in the sentence “Biological research into soil and soil organisms has proven beneficial to organic farming.”, “organic farming” and “biological research” are tagged as adjective modifiers.

Domain Term Induction Using HITS. HITS algorithm [5, 10] is applied to identify the most important domain terms. It is composed of two major components – Hubs and Authorities. Hubs are represented by Shallow Semantic Relations and authorities are represented by nouns. Hub score is calculated as the sum of authority scores and authority score is calculated as the sum of hub scores. Hub and Authority score are calculated recursively until hub and authority score converges. The Shallow Semantic Relation which has high hub score are selected as multi-grams and nouns which has high authority score are selected as unigrams. These unigrams and multi-grams constitute the domain terms.

3.3 Taxonomic Relation Extraction

Taxonomic Relations represent hypernym-hyponym relation. A hypernym represents the specific semantic field of a hyponym and a hyponym represents the generic semantic field of the hyponym. The three steps involved in building a taxonomy involves (i) Hearst Pattern Extraction and (ii) Morpho-syntactic Pattern Extraction.

Hearst Pattern Extraction. Hearst Patterns [4] are commonly used to extract taxonomic relations from text. In our work we leverage rule based technique presented in the above mentioned paper to induce taxonomy. Sentences containing the domain terms are selected for identification of Hearst Patterns. Sentences are tagged using parts-of-speech tagger to find taxonomic relations.

Morpho Syntactic Pattern Extraction. In our work we have also extracted Morpho Syntactic Patterns [12] to extract additional Hypernym-Hyponym relations. There are two rules followed to extract morpho-syntactic patterns.

Rule 1: If the term \({t_1}\) contains a suffix string \({t_0}\), then the term \({t_0}\) is the hypernym of the term \({t_1}\), provided the term \({t_0}\) or \({t_1}\) is a domain term. For example, “polysaccharide” is considered as the hypernym of the term “homopolysaccharide”.

Rule 2: If the term \({t_0}\) is the head term of the term \({t_1}\), then \({t_0}\) is considered as the hypernym of the term \({t_1}\), provided term \({t_0}\) or \({t_1}\) is the domain term. Example: “sweet corn” is the hyponym of the word “corn”.

3.4 Non-Taxonomic Relation Extraction

Non-Taxonomic Relations represent the properties of the object. It has no class-subclass relationship.

Triplet Extraction. A sentence is composed of three components - subject, predicate and object. A triplet in a sentence is defined as the relation between the subject and the object, with the relation being the predicate. Parsed documents using Stanford Parser are input to the triplet extraction process. Subject, predicate and object from the sentences is extracted using Russu’s Triple Algorithm [13].

Association Rule Mining. Association Rule Mining [17] is performed to find the non-taxonomic relations between the domain terms. Apriori Algorithm is used for frequent itemset generation and association rule mining. Frequent itemset whose support crosses a suitable threshold are selected for mining association rules. Association rules are filtered from frequent itemsets and association rules which satisfy a suitable confidence score are selected.

3.5 Domain Ontology Building

The concepts with the taxonomic and non-taxonomic relations are represented in a Resource Description Framework format. The concepts consists of a concept id, a broader relation, a narrower relation and a non-taxonomic relation associated with it. The broader/narrower relation are represented by class/subclass relations. Non-taxonomic relations consists of a property, domain and range. The domain of a property represent the subject whose predicate is that property. The range of a property represent the object whose predicate is that property. Example: “rice” is a concept with concept id “12143”, narrower relations “long-grain rice”, broader relation “crops”, “medium-grain rice”, “short-grain rice”, property “grows in”, domain “rice”, range “South India”.

4 Results and Evaluation

4.1 Domain Corpus Collection

Domain Corpus Collection consists of implementing an iterative focused web crawler that crawls pages relevant to the domain. 22 seed URLS pertaining to agriculture domain were given as input to the focused crawler. 20,632 documents were obtained at the end of crawling a depth of 3. 22 relevant links were crawled in depth 0, 134 relevant links were crawled in depth 1, 816 relevant links were crawled in depth 2 and 19732 relevant links were crawled in depth 3.

Table 1. Number of links crawled through HREF, anchor text and link context
Table 2. Number of documents in different similarity range

Table 1 shows the Number of Links crawled through HREF, Anchor Text and Link Context. It is observed that most of the links were found to be relevant through HREF and Link Context. HREF usually contain the text present in the Anchor Text. So, if the relevance fails through HREF there is a high probability of checking the Link Context. Table 2 shows the Number of documents in different similarity range compared to SeedURL pages. It can be seen that most of the pages similarity were in the range of 0.5 to 0.6. It was also observed that the median of relevance score follows a decreasing trend and the number of irrelevant links crawled increased after a depth of 3. In our work, Convergence Score [20] was used to evaluate the Iterative Focused Crawler. It is defined as the number of concepts present in the final crawl to the number of concepts present in initial seed page set and has score range between 0 and 1. The convergence score was evaluated to be 0.2 and 0.43 for baseline crawling and proposed focused crawling respectively. It can be inferred that the proposed crawling mechanism was twice more effective than traditional baseline crawling approaches.

4.2 Domain Term Extraction

The precision scores of Graph Based Domain Term Extraction using HITS algorithm used in our work is evaluated against statistical measures such as Linguistic Patterns, Inverse Document Frequency, C-value (LIDF score) and Graph Based Algorithm Terminology Ranking Based on Graph Information - TeRGraph proposed by [8] and sum of statistical scores obtained from Domain Pertinence (DP), Domain Consensus (DC), Lexical Cohesion (LC) and Structural Relevance (SR) proposed in [9] is shown in Table 3. GENIA corpus used in [8] was used for evaluation purpose. The measures shows that graph based HITS algorithm shows better precision compared to statistical measures and Graph Based algorithm TeRGraph.

Table 3. Precision scores of term extraction using HITS, LIDF, TeRGraph and DP+DC+LC+SR

4.3 Domain Ontology

Hearst Patterns and Morpho-Syntactic patterns were used to induce Taxonomy. Total of 6539 Hearst Patterns and 2149 Morpho-Syntactic patterns were extracted to construct the Taxonomy. 5216 triples were extracted and 357 Non Taxonomic Relations were identified using Association Rule Mining. In our work, Domain Ontology was evaluated using Metic Based Evaluation techniques Inheritance Richness and Class Richness [19].

Class Richness. This metric is related to how instances are distributed across classes. The number of classes that have instances in the KB is compared with the total number of classes, giving a general idea of how well the KB utilizes the knowledge modeled by the schema classes. Low Class Richness implies KB does not have data that exemplifies all the class knowledge that exists in the schema. High CR would indicate that the data in the KB represents most of the knowledge in the schema. Table 4 shows the Class Richness score for Taxonomy and Non Taxonomy learning methods.

Inheritance Richness. Inheritance Richness measure describes the distribution of information across different levels of the ontology’s inheritance tree or the fan-out of parent classes. This is a good indication of how well knowledge is grouped into different categories and subcategories in the ontology. This measure can distinguish a horizontal ontology (where classes have a large number of direct subclasses) from a vertical ontology (where classes have a small number of direct subclasses). Table 4 shows the Class Richness score for Taxonomy and Non Taxonomy learning methods. From the results of the evaluation metrics (class richness and inheritance richness), it is evident that the constructed ontology has a good density depicting that the concepts extracted represents a wider knowledge in the domain.

Table 4. Inheritance and class richness scores

5 Conclusion and Future Work

In our work, we have developed an iterative focused crawler for collection of domain corpora, with each element in the co-occurrence matrix weighted as product of co-occurrence frequency and IDF of row and column. Domain terms were extracted without any manual annotated resource unsupervised using HITS algorithm with Hubs as Shallow Semantic Relation and Authority as Nouns. The ranked terms were removed of noise using Domain Pertinence. In this work, taxonomy was induced using Hearst Patterns and Morpho-Syntactic Patterns. The Ontology was built automatically without supervision from scratch. In the future, we intend to exploit deep learning methods for building Domain Ontology to make it meaningful and useful.