Keywords

1 Introduction

The popularity of RDF resources has grown gradually since 2001. For example, Linked Data increased in the number to 32 billion RDF triples by 2011Footnote 1. The growing needs of the RDF resources push organizations to publish their own RDF format data, and they create RDF data by transforming their legacy data, such as relational database (RDB) or Web pages, by using transformation programs [5, 11, 18, 30]. However, the use of the RDF data sets automatically generated from the programs has decreased due to lack of an ontology describing the relationship resident in the meta-data. For example, Linked Life DataFootnote 2 realizes mappings at both instance and predicate levels, but not at the class level because of lack of an ontology or concept hierarchyFootnote 3.

There are three ways for solving the problem: (1) generate an ontology from the data sources before publishing RDF data [1, 26, 35, 36]; (2) map entities to existing ontologies during RDF data generation [3, 28]; and (3) generate an ontology directly from RDF resources or Linked Data sets [24, 40], a task similar to generating a concept hierarchy from retrieved documents or social tags in Information Retrieval (IR) known as hierarchical document clustering [31, 33]. Linked Data sets and RDF resources store all the triples for entities as RDF documents [4]. This makes hierarchical document clustering methods potentially applicable to ontology generation in the Semantic Web.

In this paper, we introduce a framework for generating ontology from Linked Data or other RDF resources, based on hierarchical document clustering algorithms. We adapted the three most popular methods known as UPGMA [20], Subsumption [31], and EXT [19], to generate ontology in our framework. We evaluate the three algorithms with a preliminary experiment and the experiment shows that EXT is the best algorithm to build the ontology for RDF resources. We also notice that the quality of the ontology generated can be affected by the number of concepts that are used to represent the entities and to formalize the classes in the ontology, and suggest to improve the quality by changing the parameters of the algorithms to adjust the number of the layers in the generated ontology.

The rest of the paper is organized as follows: Sect. 2 introduces the related works; Sect. 3 introduces the framework designed for ontology generation; Sect. 4 systematically describes the methods using for the framework; Sect. 5 presents our experiment results; and finally, we provide conclusions in Sect. 6.

2 Related Works

The ontology generation or the concept hierarchy generation is considered as one branch of Knowledge Discovering and can be used for describing the relationship of meta-data [16, 23]. Previous several studies used clustering and classification algorithms to build the taxonomical relationships based on inter-correlation attributes for databases [13, 17]. Later, as the popularity of the Semantic Web grew, more studies used other machine learning approaches to build ontology, specially taxonomical relationship [7, 32] to help transform databases into RDF data or to boost the application of the Semantic Web.

The taxonomical relationships or the concept hierarchy extracted from text data has been used to represent the search results in a hierarchy. The most traditional methods for building the hierarchy are based on hierarchical clustering algorithms known as agglomerative UPGMA and bisecting k-means [20]. The bisecting k-means is considered a better solution than UPGMA in performance [34]. Probability models are used to build the hierarchy and are reported as better algorithms than traditional ones in performance. The Subsumption [31], which is a kind of co-occurrence relationship of two concepts extracted from documents, is used to organize the concept hierarchy for retrieved documents. The subsumption is a kind of probability method to calculate the probability of is_A relationship of two concepts and is considered as one of the most classical methods for concept hierarchy generation. Studies, such as [9, 33], improved the subsumption-based approaches for different usages. Other studies are inspired to advance the precision of the subsumption-based method using a probability model. The DSP [22] computes the importance of every concept by topicality and predictiveness. The most important concepts are put to the top level of the hierarchy using the greedy approximation of the Dominating Set Problem (DSP). DisCover [21] maximizes hierarchy coverage while maintaining distinctiveness of concepts to identify the concepts in the hierarchy and receives better precision than DSP. FIHC [14] measures the cohesiveness of a cluster by using the frequent item sets that are calculated by the Global support and the Cluster support. FIHC is reported as better than agglomerative UPGMA and bisecting k-means [34]. Other famous studies that apply Formal Concept Analysis (FCA) to build a hierarchy are introduced in [8, 12]. These studies build the concept-feature lattices and remove the features of the formalized graph to make a human readable hierarchy. Reference [25] uses Self-Organizing Map (SOM) to build the hierarchy by three different feature extraction approaches.

The concept hierarchy and the taxonomical relationships are also studied for social tags. Reference [33] uses Subsumption to induce the hierarchy from flicker tags. Reference [37] builds the hierarchy by using heuristic rules and deep syntactic analysis. EXT [19] induces a similarity graph to build the hierarchical taxonomy. An extensible greedy algorithm is used to place concepts that are the center of the similarity graph into the hierarchy. Reference [19] replaces the extensible greedy algorithm in [19] with a Directed Acyclic Graph (DAC) allocation algorithm, which allows the classes to maintain multiple super classes in the hierarchy.

In this research, we chose three most representative algorithms from the existing studies: a classic hierarchical clustering algorithm known as UPGMA [20, 34], a popular probabilistic algorithm called Subsumption [9, 31], and the most latest approach EXT [19].

3 Framework of Building Taxonomical Relationship of Entities

Fig. 1.
figure 1

The framework of building taxonomical relationship of entities

We separated the procedure of taxonomical relationship generation into four parts as shown in Fig. 1: Data Preparation, Pre-processing, Taxonomical Relationship Induction and Post-processing.

Data Preparation is designed to collect every piece of information about an entity to form an RDF document. We partitioned the RDF data graphs into small graphs called RDF documents, each of which contains the description of an entity. The triples or the quads are extracted to form a star-shaped document [4]. For example, an RDF document named “Acetylsalicylic acid” formalized by Data Preparation contains all the triples of the entity “Drugbank:drugs/DB00945” as the subject.

Pre-processing is designed to extract important nouns and noun phrases as candidate concepts of the RDF documents. First, the nouns and the noun phrases are extracted by a Part Of Speech (POS) tagger. Second, after lemmatizing and stemming, the nouns and the noun phrases are considered as candidate concepts in building the concept-entity matrix. Third, the concept-entity matrix is used to extract important concepts based on Vector Space Model (VSM) and Latent Semantic Analysis (LSA) [15]. The components of the matrix are calculated by Term Frequency and Inverted Document Frequency (TF-IDF) [29], and the matrix is decomposed by using Singular Value Decomposition (SVD) [2]. The important concepts are extracted by decreasing the context dimension by using LSA, and the important concepts are used to generate the new concept-entity matrix after pre-processing.

Taxonomical Relationship Induction is designed to construct the hierarchy based on the concept-entity matrix from the Pre-processing step. We implemented UPGMA, Subsumption and EXT introduced in Sect. 2 to build the taxonomy.

Post-processing is designed to instantiate the concepts of the hierarchy created in Taxonomical Relationship Induction. The concepts in the hierarchy are considered as classes in the ontology and the documents of an entity containing the concepts are considered as the instances of the classes.

Fig. 2.
figure 2

An example of data preparation

4 Methodology

4.1 Data Preparation and Pre-processing

The Linked Data and other RDF resources are stored as either triples or RDF documents [6]. To adapt the existing hierarchical clustering algorithms, we need to process each entity as a document, which in effect partitions the whole data graph into small sub-RDF documents. For example, in Fig. 2, a data graph containing four entities is partitioned into four sub-RDF documents. We extracted nouns and noun phrases from each RDF document using the POS tagger after removing the punctuation marks and stop-words as mentioned in papers [27, 39]. The nouns and the noun phrases tagged with NN (singular noun), NNP (proper noun), NNS (plural noun) and NNPS (plural proper nouns) are put into noun sequences. We generated candidate concepts from the noun sequences by three types of n-gram: unigrams, bigrams and trigrams [38]. The candidate concepts are lemmatized and stemmed for building the concept-entity matrix, which is used for finding the most important concepts that represent the entity.

We adapted VSM to represent documents. Each RDF document of the collection is defined as a single multidimensional vector \(\varvec{d}=[w_{1},w_{2},\ldots ,w_{i}]\), where each component \(w_{i}\) is a weight of the dimension \(i\) and each dimension reflects a candidate concept. We used TF-IDF to weight each component in the \(\varvec{d}\). The TF-IDF is calculated as follows:

$$\begin{aligned} w_{i,d}=tf_{i,d}*log\frac{\#documents}{\# (documents \ contaning \ concept \ i)} \end{aligned}$$
(1)

where the \(tf_{i,d}\) is the normalized frequency of concept \(i\) in document \(d\).

A vector of the single multidimensional vectors \(\varvec{d}\) is used to build an entity-concept matrix that is transposed into a concept-entity matrix. We extracted important concepts based on LSA after decomposing the matrix using SVD. SVD [2] is considered as a data reduction method that approximates the original concept-entity matrix in a lower dimension. With the help of the dimension reduction, the concept vector can be represented by the value of a new context, that is, a lower dimension instead of the original dimension reflecting the number of the entities. The concept-entity matrix \(A\) can be broken down into three matrices: an orthogonal matrix known as \(U\), a diagonal matrix known as \(S\), and a transpose of the orthogonal matrix known as \(V\). SVD can be presented as follows:

$$\begin{aligned} A=U*S*V^{T} \end{aligned}$$
(2)

where the columns of \(U\) are orthogonal eigenvectors of \(AA^{T}\), the columns of \(V\) are orthogonal eigenvectors of \(A^{T}A\), and \(S\) is a diagonal matrix containing the square roots of eigenvalues from \(U\) or \(V\) in descending order. The \(U\) can be considered as the representation of the concepts in the context \(S\). LSA [15] finds a low-rank \(k\) approximation \(A_{k}\) to the concept-entity matrix \(A\) by selecting the k largest singular values of \(S\), the first k columns of \(U\) and \(V^{T}\). Hence \(U_{k}\), formed by the first k columns of \(U\), can be regarded as the representation of the important concepts in top k [39]. In practice, we extracted the important concepts that reflect the most weighted values of the \(U_{k}\) in each column. Then the entity can be represented with the most important concepts extracted. The important concepts in top k and the entities form a new concept-entity matrix \(M\). A simple example of the procedure of pre-processing is shown in Fig. 3.

Fig. 3.
figure 3

An example of the procedure of the pre-processing

4.2 Taxonomical Relationship Induction

In this section, we present how to construct a concept hierarchy for the three methods using the new concept-entity matrix \(M\).

UPGMA. UPGMA [20] constructs a concept hierarchy based on the similarity of two concepts. The computation of the similarity of two concepts is based on VSM as shown below:

$$\begin{aligned} Sim(c_{1},c_{2})=\frac{\sum _{1}^{n}(w_{(e_{i},c_{1})}*W_{(e_{i},c_{2})})}{\sqrt{\sum _{1}^{n}(w_{(e_{i},c_{1})})^{2}}*\sqrt{\sum _{1}^{n}(w_{(e_{i},c_{2})})^{2}}} \end{aligned}$$
(3)

where \(w_{(e_{i},c_{j})}\) is the IF-IDF value of the entity \(e_{i}\) and the concept \(c_{j}\) computed by Eq. 1. The new node, the superclass of the two most similar concepts, is created as a virtual class that has the common entities shared by the two concepts. We generate the concept hierarchy by using the Algorithm 1 as shown below:

figure a

Subsumption. Subsumption [31] defines the is_A relationship based on the co-occurrences in different RDF documents of entities. Given \(c1\) and \(c2\), \(c1\) is said to subsume \(c2\) if the following conditions are satisfied:

$$\begin{aligned} P(c1|c2)>0.8, \ P(c2|c1)<1 \end{aligned}$$
(4)

We generate the concept hierarchy using the Algorithm 2 shown as follows:

figure b
figure c

EXT. EXT [19] uses an extensible greedy algorithm to put the most important concept into the subclass of the most similar concept in the current level of the concept hierarchy. The sequence in which to choose the most important concept is decided by the centrality of the similarity graph created from the concept-entity matrix. The centrality of the concepts in the similarity graph is computed as follows:

$$\begin{aligned} Cen(c_{x})=\sum _{c_{x}\ne c_{y} \ne c_{z} \in C} \frac{ \delta _{c_{y}c_{z}(c_{x})} }{ \delta _{c_{y}c_{z}} } \end{aligned}$$
(5)

where \(\delta _{c_{y}c_{z}}\) is the total number of shortest paths from concept \(c_{y}\) to concept \(c_{z}\), and \(\delta _{c_{y}c_{z}(c_{x})}\) is the number of concepts that pass through \(c_{x}\). The similarity graph consists of vertices that are concepts, and edges that are added if the similarity of the two concepts is above the threshold \(\alpha \). The similarity of the two concepts is calculated by using Eq. 3, and whether a concept \(c_{i}\) should be put as the subclass of a concept \(c_{j}\) depends on the threshold \(\beta \). Concept hierarchy is generated by using Algorithm 3 as follows:

4.3 Post-processing

We put all the entities into the generated concept hierarchy based on the concept-entity matrix. For example, a class named “anticoagulant” in Fig. 3 will be assigned with three instances “E1:anisindione”, “E3:porstaglandin G/H synthase 1”, and “antipyrine”. Notice that, since UPGMA and Subsumption support multiple inheritance [31, 34], we also allow multiple inheritance for the assignment of instances during the post-processing step.

5 Preliminary Experiment

We implemented our framework based on JDK_1.6 using the Intel, I-5 CPU with 4 GB RAM and 1TB hard disk on a UbuntuFootnote 4 system.

5.1 Data Set and Gold standard

We used DiseasomeFootnote 5 from Linked Life DataFootnote 6 as our source data for our experiment. The Diseasome supplies entities about diseases and also describes schemas relationships. We separated the data set into two parts: data that contains the triples of entities, and a gold standard that contains the triples about the schema. For example, given two triples “\(<Diseasome:diseases/272> <Diseasome:class> <Diseasome:diseaseClass/Endocrine>\)” and “\(<Diseasome:diseases/2013> <Diseasome:subtypeOf> <Diseasome:diseases/272>\)”, we can get an is_A relationship in which a class named “Endocrine” has a subclass named “diseases:272” with an instance named “diseases:2013”. The Diseasome contains 1308 instances of diseases, and we used them to create a disease ontology in our experiment.

5.2 Data Preparation and Pre-Processing

We partitioned the data into star-shaped RDF documents and parsed the documents using the POS taggerFootnote 7. The nouns and noun phrases extracted were stemmed and lemmatized by the Lucene String ParserFootnote 8 to form a concept-entity matrix. We used JAMAFootnote 9 to run SVDs to choose important concepts for the entities. The SVDs are computed with the different values of k such as “100, 300 and 500”, each of which means the number of concepts extracted to build an ontology. The three different concept-entity matrices, dimensionally reduced, cover different entities of the original data set, affecting the recall for the constructed ontology. Figure 4 shows the coverage of the entities by the different values of k. The coverage is computed by \(coverage(k)=\frac{\#entities(k)}{\#entities}\), where \(entities(k)\) is the entities remained by reducing the dimension into \(k\). Using 100 concepts only covers 13 % of the entities and using 500 51 % of the entities. We noticed that coverage increases as the number of concepts increases.

Fig. 4.
figure 4

The coverage of using different number of concepts for building an ontology

5.3 Criteria for Evaluation

In order to evaluate the taxonomies generated by the three approaches, we compared generated ontologies with the gold standard. Since class names in the taxonomies vary in the three methods, lexical comparisons of the class names are inappropriate. We adapted the Taxonomic Precision (TP) and Taxonomic Recall (TR) [10, 25] to measure the quality of the generated ontologies.

TP and TR are based on the Semantic Cotopy (SC), which defines super-sub concepts (is-A) relations in an ontology:

$$\begin{aligned} SC(c,o)=\{c_{i}|c_{i}\in C\wedge (c_{i} \leqslant c \vee c \geqslant c_{i})\} \end{aligned}$$
(6)

The Common Semantic Cotopy (CSC) that avoids the influence of lexical precision in the taxonomic measurement can be defined as:

$$\begin{aligned} CSC(c_{i},O_{1},O_{2})=\{c_{j}|c_{i}\in C_{1} \cap C_{2} \wedge (c_{j} \leqslant C_{1} c_{i} \vee c \geqslant C_{1} c_{j})\} \end{aligned}$$
(7)

The \(TP_{CSC}\) and \(TR_{CSC}\) can be computed as:

$$\begin{aligned} TP_{CSC}(O_{1},O_{2})=\frac{1}{|CO_{1} \cap CO_{2}|} \sum _{c \in CO_{1} \cap CO_{2}}TP_{CSC}(c,c,O_{1},O_{2}) \end{aligned}$$
(8)
$$\begin{aligned} TR_{CSC}(O_{1},O_{2})=TP_{CSC}(O_{2},O_{1}) \end{aligned}$$
(9)

Taxonomic F-measure \((TF)\) calculates the harmonic mean of \(TP_{CSC}\) and \(TR_{CSC}\):

$$\begin{aligned} TF(O_{1},O_{2})=\frac{2*TR_{CSC}(O_{1},O_{2})*TP_{CSC}(O_{1},O_{2})}{TR_{CSC}(O_{1},O_{2})+TP_{CSC}(O_{1},O_{2})} \end{aligned}$$
(10)

5.4 Results

Fig. 5.
figure 5

The running time of the three algorithms

We used UPGMA, Subsumption, and EXT to generate ontologies in our experiment. For EXT, the \(\alpha \) and the \(\beta \) are both set to \(0.5\).

Figure 5 shows the running times of the three algorithms. Subsumption is the fastest algorithm: 116 ms for 100 concepts, 391 ms for 300, and 2683 ms for 500. The running times of the three methods increase as the number of concepts used increases.

Fig. 6.
figure 6

The quality of the ontologies generated by the three algorithms

Figure 6 shows ontology quality as the number of concepts involved changes. For TR, UPGMA gets the highest score among the three algorithms: 11 % for 500 concepts. The low TRs obtained by the algorithms are in part due to the decomposition of the concept-entity matrix, which makes the converge of entities for building the ontologies limited. For TP, Subsumption and EXT show almost the same precision that reaches 24 % when using 500 concepts. The F-measure demonstrates that quality increases as more concepts are used.

Fig. 7.
figure 7

The average TP, TR, and F-measure of the ontologies generated by the three algorithms, which are evaluated by the sub-structure in the gold standard. The sub-structure only contains the entities used after reducing dimensions of RDF documents

We used only those entities of the gold standard that were covered by the new decomposed concept-entity matrix to get the F-measure shown in Fig. 7. The figure shows that UPGMA performs poorer than the other two algorithms on F-measure but performs the best on Recall. EXT performs better than both Subsumption and UPGMA on F-measure. Subsumption received the best score on Precision. Considering the running times of the three algorithms, the authors regarded EXT as the best algorithm for our purpose.

Each hierarchical document clustering method adapted in this paper builds an ontology that has more layers than the ontology generated from the gold standard has, affecting the quality of the former ontology. However, the number of layers in the ontology can be reduced by adjusting parameters in the algorithms - an improvement work reserved for the future.

6 Conclusion

In this paper, we introduced a framework for generating ontology from Linked Data or RDF resources based on hierarchical document clustering algorithms that are used to generate concept hierarchies from retrieved documents or social tags. We implemented three most classic and popular methods for hierarchical document clustering to generate ontology in our framework. We evaluated the three algorithms with a preliminary experiment and the experiment shows EXT is the best algorithm to build the ontology for RDF resources. We also learned that the quality of the ontology generated can be affected by the number of concepts that are used to represent the entities and to formalize the classes in the ontology, and suggested an ontology quality improvement that reduces the number of layers in the ontology.