Schema Discovery in RDF Data Sources

Kellou-Menouer, Kenza; Kedad, Zoubida

doi:10.1007/978-3-319-25264-3_36

Kenza Kellou-Menouer¹⁸ &
Zoubida Kedad¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9381))

Included in the following conference series:

International Conference on Conceptual Modeling

2269 Accesses
18 Citations

Abstract

The Web has become a huge information space consisting of interlinked datasets, enabling the design of new applications. The meaningful usage of these datasets is a challenge, as it requires some knowledge about their content such as their types and properties. In this paper, we present an automatic approach for schema discovery in RDF(S)/OWL datasets.

We consider a schema as a set of type and link definitions. Our contribution is twofold: (i) generating the types describing a dataset, along with a description for each of them called type profile; (ii) generating the semantic links between types as well as the hierarchical links through the analysis of type profiles. Our approach relies on a density-based clustering algorithm and it does not require any schema-related information in the dataset. We have implemented the proposed algorithms and we present some evaluation results showing the effectiveness of our approach.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Discovering Types in RDF Datasets

Scalable Schema Discovery for RDF Data

Incremental Schema Discovery at Scale for RDF Data

Keywords

1 Introduction

The Web has evolved into a huge global information space, consisting of interlinked datasets, expressed in the Semantic Web standard languages such as RDF, RDFS or OWL. An unprecedented amount of data is made available, enabling the design of new applications in many domains. But if access to these datasets is provided, their meaningful usage is still a challenge, because it requires some knowledge about their content. For example, a user willing to query a dataset needs to know some of the properties and types existing in this dataset. This information could be obtained by randomly browsing the data, but this would be a tedious task. Writing the query would be straightforward with a schematic description of the dataset. This schema would be useful for other purposes, such as creating links between datasets; some interlinking tools have been proposed, such as Silk^{Footnote 1}, which was used to link Yago [14] to DBpedia [1], but they require type and property information about the datasets to generate the appropriate owl:sameAs links.

An RDF(S)/OWL dataset may provide some information about its schema, such as rdf:type, specifying the type of a given entity, but this information is not always complete. Even when datasets are automatically extracted, such as DBpedia which is extracted from Wikipedia, type information can be missing. The experiments presented in [12] show that at most 63.7 % of the data have complete type declarations in DBpedia, and at most 53.3 % in YAGO. We argue that providing a schematic description of a dataset is essential for its meaningful exploitation. In the context of Web data, the notion of schema is understood as a guide, not as a strict representation to which the data must conform. Indeed, languages used to format data on the Web do not impose any constraint on its structure: two entities having the same type may have different properties. In addition, an entity may have several types.

The goal of our work is to extract the schema describing an RDF(S)/OWL dataset and we propose a deterministic and automatic approach for schema discovery. We consider a schema as a set of type and link definitions. Our contribution is twofold: (i) generating the types describing a dataset, along with a description for each of them called type profile, where each property is associated to a probability; (ii) generating the semantic links between types as well as the hierarchical links through the analysis of type profiles. Our approach relies on a density-based clustering algorithm, it does not require any schema related information and it can detect noisy instances. We have implemented the algorithms underlying our approach and we present some evaluation results using different datasets to show the quality of the resulting schema.

The paper is organized as follows. We introduce a formal definition of the problem in Sect. 2 and we give some definitions related to entity description in Sect. 3. In Sect. 4, we present the core algorithms of our schema discovery approach. Section 5 presents the generation of overlapping types and Sect. 6 is devoted to links generation. Our evaluation results are presented in Sect. 7. We discuss the related works in Sect. 8 and finally, Sect. 9 concludes the paper.

2 Problem Statement

Consider the sets R, B, P and L representing resources, blank nodes (anonymous resources), properties and literals respectively. A dataset described in RDF(S)/OWL is defined as a set of triples $D \subseteq (R \cup B) \times P \times (R \cup B \cup L)$. Graphically, a dataset is represented by a labeled directed graph G, where each node is a resource, a blank node or a literal and where each edge from a node e to another node $e'$ labeled with the property p represents the triple (e, p, $e'$) of the dataset D. In such RDF(S)/OWL graph, we define an entity as a node corresponding to either a resource or a blank node, that is, any node apart from the ones corresponding to literals.

Figure 1(a) shows an example of such dataset, related to conferences. We can see that some entities are described by the property rdf:type, defining the classes to which they belong, as it is the case for “UVSQ”, defined as a university. For other entities, such as “WWW”, this information is missing. Two entities having the same type are not necessarily described by the same properties, as we can see for “UVSQ” and “MIT” in our example, which are both associated to the “University” type, but unlike “UVSQ”, “MIT” has a “website” property.

Our problem can be stated as follows: given an RDF(S)/OWL dataset with incomplete schema information, how to infer type definitions and links between these types? As a result, we provide an extracted schema defined as follows:

Definition. The extracted schema S of a dataset D is composed of:

A set of possibly overlapping classes C = {$C_{1},..., C_{n}$}, where each class $C_{i}$ corresponds to a set of entities in D and defines their type;
A set of links {$p_{1},..., p_{m}$}, such that each $p_{i}$ is a property for which both the range and the domain are two types corresponding to two classes in C;
A set of hierarchical links involving two types corresponding to classes in C, expressing that one is the generic type of the other.

Table 1. Entity types in the conference dataset (Fig. 1(a)).

Full size table

In order to define the schema describing a dataset, we first need to evaluate the similarity between entities and then group the similar ones into types, knowing that entities having the same type could be described by heterogeneous properties and that a given entity may have several types. In the example given in Fig. 1(a), seven type definitions would be inferred (see Fig. 1(b)), the respective sets of entities corresponding to each type is given in Table 1. As we can see in the results, the classes are not necessarily disjoint: the entity “P1” has three types, namely “Author”, “Organizer” and “Person”.

Schema discovery also requires the identification of links between types on the basis of existing properties between entities, given the heterogeneity of the corresponding property sets. In our example, as entities “P4” and “ISWC” are related through the “made” property, there is a link labeled “made” between “Organizer” and “Conference” in the resulting schema. Finally, another problem we are faced with is the one of identifying types generalizing other types, which could be represented by rdfs:subClassOf properties, such as the ones between the generic type “Person” and its specific types “Author” and “Organizer”.

3 Entity Description

We consider that an entity is described by different kinds of properties. Some of them are part of the RDF(S)/OWL vocabularies, and we will refer to them as primitive properties, and others are user-defined. We distinguish between these two kinds because all the properties should not be used for the same purpose during type discovery. Some predefined properties may provide information about the schema of the dataset and could be applied to any entity, therefore they should not be considered when evaluating the similarity between entities. Some of these properties can be used to validate the resulting schema: for example, if two entities e and $e'$ are linked with the owl:sameAs primitive property, then we can check that the types inferred for e and $e'$ are the same. We define the description of an entity as follows.

Definition. Given the set of primitive properties $P_{P}$ and the set of user-defined properties $P_{U}$ in the dataset D, an entity e is described by:

1. A user-defined property set $e.P_{U}$ composed of properties $p_{u}$ from $P_{U}$, each one annotated by an arrow indicating its direction, and such that:

If $\exists (e, p_{u}, e') \in D$ then $\overrightarrow{p_{u}} \in e.P_{U}$;
If $\exists (e', p_{u}, e) \in D$ then $\overleftarrow{p_{u}} \in e.P_{U}$.

2. A primitive property set $e.P_{P}$, composed of properties $p_{p}$ from $P_{P}$ with their values, such that:

If $\exists (e, p_{p}, e^\prime ) \in D$ then $p_{p} \in e.P_{P}$.

Figure 2 shows an example of entity description. In order to infer type definitions, entities are compared according to their structure. Our type discovery algorithm takes as input the set $D_{U} = \{e_{i}.P_{U}$ : $i = 1,...n$}, where $e_{i}.P_{U}$ represents the set of user-defined properties describing the entity $e_{i}$. The set $D_{P}= \{e_{i}.P_{P}$ : $i = 1,...n$}, where $e_{i}.P_{P}$ represents the set of primitive properties describing the entity $e_{i}$, can be used to validate the results of the algorithm and to specify type checking rules. This validation process is out of the scope of this paper.

4 Generating Types

Our requirements for type discovery are as follows: (i) the number of types in the dataset is not known, (ii) an entity can have several types, and (iii) the datasets may contain noise. The most suitable grouping approach is density-based clustering [3] because it is robust to noise, deterministic and it finds classes of arbitrary shape, which is useful for datasets where entities are described with heterogeneous property sets. In addition, unlike the algorithms based on k-means and k-medoid, the number of classes is not required.

Our density-based algorithm has two parameters: $\varepsilon $, representing the minimum similarity value for two entities to be considered as neighbors, and MinPts, representing the minimum number of entities in the neighborhood required for an entity to be a core [3] and to generate a type; MinPts is used to exclude the outliers and the noise. We use Jaccard similarity to compute the closeness between two property sets $e.P_{U}$ and $e'.P_{U}$ describing respectively two entities e and $e'$.

Beside the set of types, we provide a description of each of them called a type profile. The profiles will be used to find links between types and to generate overlapping types, i.e. multiple types for each entity. A profile consists in a property vector where each property is associated to a probability. The profile corresponding to a type $T_{i}$ is denoted $TP_{i}$ = (($p_{i1}$, $\alpha _{i1}$), ..., ($p_{in}$, $\alpha _{in}$)), where each $p_{ij}$ represents a property and where each $\alpha _{ij}$ represents the probability for an entity of $T_{i}$ to have the property $p_{ij}$. It is evaluated as the number of entities in $T_{i}$ having the property $p_{ij}$ over the total number of entities in $T_{i}$. In Fig. 1, the profile of the “University” type is: ($(\overrightarrow{name},1),(\overrightarrow{website},0.5),(\overleftarrow{affiliation},1)$). The probability of “website” is 0.5 because this property is defined for “MIT” but not for “UVSQ”.

In order to group similar entities and build the profile of each class, we have adapted a density-based clustering algorithm (Algorithm 1); it uses the FindNeighbors function which returns for a given entity all the entities having a distance smaller than $\varepsilon $. For each new entity e such that the number of its neighbors is lower than MinPts, we expand the class of e using the ExpandCluster function, defined in Algorithm 2, which finds all entities belonging to the same class as the current entity e. Each time a new entity e belonging to the current class is found, the type profile of this class is updated using the UpdateTypeProfile function which adds the properties of e if they are not already in the profile and recomputes the probabilities.

5 Generating Overlapping Types

An important aspect of RDF(S)/OWL datasets is that an entity may have several types [12]. A fuzzy clustering algorithm such as FCM or EM could be used to assign several types to an entity. However, they require the number of classes, as their grouping criterion is the similarity between an entity and the center of each class. This parameter can not be defined in our context, we therefore propose to derive the set of disjoint classes first, and then to form overlapping or fuzzy classes by analyzing the type profiles.

Recall that the importance of a property for a given type is captured by the associated probability. Figure 3(a) shows three classes generated by our algorithm: $C_{1}$ = {P1}, $C_{2}$ = {P4} and $C_{3}$ = {P2, P3}, described respectively by the following type profiles:

$TP_{1}$ = ($(\mathop {name}\limits ^{\rightarrow },1),(\mathop {made}\limits ^{\rightarrow },1),(\mathop {affiliation}\limits ^{\rightarrow },1),(\mathop {heldBy}\limits ^{\leftarrow },1),(\mathop {holdsRole}\limits ^{\rightarrow },1), (\mathop {\#\_}\limits ^{\leftarrow },1)$)
$TP_{2}$ = ($(\mathop {name}\limits ^{\rightarrow }, 1),(\mathop {made}\limits ^{\rightarrow },1)$)
$TP_{3}$ = ($(\mathop {affiliation}\limits ^{\rightarrow },1),(\mathop {heldBy}\limits ^{\leftarrow },1),(\mathop {holdsRole}\limits ^{\rightarrow },1), (\mathop {\#\_}\limits ^{\leftarrow },1)$).

The entity “P1” has all the properties of the two types $TP_{2}$ and $TP_{3}$ for which the probability is 1, we can therefore assign the corresponding types to “P1”. Fuzzy classes are generated using type profiles as follows; consider the type $T_{i}$ described by the profile $TP_ {i}$; if all the properties p of $TP_ {i}$ having a probability $\alpha = 1$ belong to another type profile $TP_ {k}$, then the type $T_{i}$ falls within the types of entities of the group k. This is expressed by the following rule:

If ($\forall (p, \alpha )$ in $TP_{i}$, $\alpha $ $=$ 1: $(p, \alpha )$ in $TP_ {k}$) then (Add $T_{i}$ to the types of the entities of class k).

In our example (Fig. 3), the cluster containing “P1” (class k here) is the intersection of the classes $C_{2}$ and $C_{3}$. Note that instead of comparing only the properties having a probability of 1, we could extend the comparison to the ones having a probability greater that a given threshold.

6 Generating Links

Beside type definitions, links between these types are also important to understand the content of the dataset at hand, as it is shown in Fig. 1(b). We are interested in two types of links: semantic links, corresponding to user-defined properties and hierarchical links, corresponding to the rdfs:subClassOf property.

Semantic links. Two types $T_{i}$, $T_{j}$ are linked by a property p if $\mathop {p}\limits ^{\rightarrow }$ belongs to the properties of type profile $TP_ {i}$ and $\mathop {p}\limits ^{\leftarrow }$ belongs to the properties of type profile $TP_ {j}$, as follow:

If ($\exists $ p: ($\mathop {p}\limits ^{\rightarrow }, \alpha _{i})$ in $TP_ {i}$ $\wedge $ ($\mathop {p}\limits ^{\leftarrow }, \alpha _{j})$ in $TP_ {j}$) then (Add $\mathop {p}\limits ^{\rightarrow }$ from $T_{i}$ to $T_{j}$).

These generated links are checked by finding two entities $e \in T_{i}$ and $e' \in T_{j}$ such that $(e, p, e') \in D$.

Hierarchical links. Two types $T_{i}$, $T_{j}$ can be linked by a hierarchical property (rdfs:subClassOf), such as “Organizer” and “Person” in Fig. 1(b). We can use a hierarchical clustering algorithm on the entire dataset to generate these links but this would be very costly; to find the best partition, all the generated hierarchy has to be explored. In addition, the result would consist in many hierarchical links, not all of them being meaningful. We generate instead the hierarchy from the type profiles, which is less expensive because the number of profiles is small compared to the size of the entire dataset. Our procedure is given in Algorithm 3. We have adapted an ascending hierarchical clustering algorithm; the profile of the generic type is built at each step of the hierarchy. We define a similarity measure between two type profiles $TP_{i}$, $TP_{j}$, inspired from the Jaccard similarity and based on the probability of a property $p_{k}$ for two classes Ci and Cj. It is defined as follows:

$$\begin{aligned} ProfileSim(TP_{i}, TP_{j}) = \frac{\sum _{\forall p_{k} \in \{TP_ {i}\cap TP_{j}\} } { Prob_{i, j}(p_{k})}}{\sum _{\forall p_{k} \in \{TP_{i}\cup TP_{j}\} } { Prob_{i, j}(p_{k})}} \end{aligned}$$

(1)

where:

$$\begin{aligned} Prob_{i, j}(p_{k}) = \frac{ \alpha _{ik} \times |C_{i}| + \alpha _{jk} \times |C_{j}|}{ |C_{i}| + |C_{j}|} \end{aligned}$$

(2)

A generic type is defined from the two most similar type profiles at each level of the hierarchy. The corresponding profile is composed of all the properties of the two types, the probability of a property is calculated as in (2).

This approach generates a hierarchy by grouping types in pairs to find their generic type. However, some generic types can be composed of more than two sub-types as in Fig. 4(a). In this case, the user can detect easily intermediate redundant generic types and eliminate them (as in Fig. 4(b)), since the number of types in a dataset is generally not very high.

7 Evaluation

This section presents some experimentation results using our approach. We have evaluated the quality of the generated schema using well-known information retrieval metrics in different real datasets, described in the following section.

7.1 Datasets

For our experiments, we have used three datasets: the Conference^{Footnote 2} dataset, which exposes data for several Semantic Web conferences and workshops with 1430 triples; the BNF^{Footnote 3} dataset which contains data about the French National Library (Bibliothèque Nationale de France) with 381 triples and a dataset extracted from DBpedia^{Footnote 4} with 19696 triples considering the following types: Politician, SoccerPlayer, Museum, Movie, Book and Country.

7.2 Metrics and Experimental Methodology

In order to evaluate the quality of the results provided by our algorithms, we have extracted the existing type definitions from our datasets and considered them as a gold standard. We have then run our algorithm on the datasets without the type definitions and evaluated the precision and recall for the inferred types. We have annotated each inferred class $C_{i}$ with the most frequent type label associated to its entities. For each type label $L_{i}$ corresponding to type $T_{i}$ in the dataset and each class $C_{i}$ inferred by our algorithm, such that $L_{i}$ is the label of $C_{i}$, we have evaluated the precision $P_{i}(T_{i}, C_{i}) = |T_{i} \cap C_{i}| /|C_{i}|$ and the recall $R_{i}(T_{i}, C_{i}) = |T_{i} \cap C_{i}|/|T_{i}|$. We have set $MinPts = 1$ so that an entity is considered as noise if it has no neighbors. For the Conference dataset, we have empirically set $\varepsilon = 0.75$, which leads to very homogeneous classes. We have empirically stated that $\varepsilon = 0.72$ provides a number of classes equal to the number of types initially declared in the BNF dataset; the value is $\varepsilon = 0.5$ for the DBpedia dataset. Note that the determination of the similarity threshold $\varepsilon $ is an open issue for clustering algorithms [3, 13].

To provide the overall quality of types, semantic and hierarchical links, we have used the precision and recall metrics. The number of generated classes is denoted k, and the number of entities in the dataset, n. To assess the overall type quality, each type is weighted according to its number of entities as follows:

$$P = \sum _{i=1}^{k} \frac{|C_{i}|}{n} \times P_{i}(T_{i}, C_{i}) \qquad \qquad R = \sum _{i=1}^{k} \frac{|C_{i}|}{n} \times R_{i}(T_{i}, C_{i})$$

Precision and recall of the generated links are evaluated considering the true/false positives and the false negatives. We have compared the inferred schema to the one of the dataset when it was provided, as for the BNF dataset. If no schema was provided, we have manually designed it based on the information provided in the dataset; to this end, we have built the set $D_{P}$ as defined in Sect. 3.

7.3 Results

The quality of each discovered type in the Conference dataset is shown in Fig. 5(a). Our approach gives good precision and recall and detects types which were not declared in the dataset: classes 6, 10, 11 and 12 are manually labeled “AuthorList”, “PublicationPage”, “HomePage” and “City” respectively.

In some cases, types have been inferred relying on incoming properties only. Indeed, for containers, such as “AuthorList”, it is necessary to consider these properties as they do not have any outgoing property. Classes 1 and 7 do not have a good precision because they contain entities with different types in the dataset. However, these types have the same structure, it is therefore impossible to distinguish between them. The recall for “Person” is not good because it is split into three classes: class 8 represents persons who have both published and played a role in the conference; class 2 represents persons who have only played a role (e.g. Chair, Committee Member); class 5 represents persons who have only published. Overlapping types are generated based on the analysis of type profiles. This has led to the results shown in Fig. 5(b). Class 8 is associated to two types: the one of class 2 (manually labeled “Organizer”) and the one of class 5 (manually labeled “Author”), which indeed conforms to the entities of the dataset. Note that finding the labels of classes is an open problem that we will address in future works.

We can see in Fig. 6 that the approach gives good precision and recall for the generated schema, composed of types, semantic and hierarchical links. For the Conference dataset (see Fig. 6(a)) the precision is not maximum because of classes 1 and 7 as discussed before. The recall is not maximum because the “Person” type is split in three as discussed above. The results for the BNF (see Fig. 6(b)) and DBpedia datasets (see Fig. 6(c)) show that the assignment of types to entities has achieved good precision and recall. The recall is not maximum because noisy instances were detected. For the BNF dataset, some of the semantic links were not declared in the provided schema, which is why the recall is not maximum. However, after checking the entities of the dataset, we found out that these semantic links were valid. For the DBpedia dataset, our algorithm was able to differentiate between entities of the two types “Politician” and “SoccerPlayer” even if they have similar property sets, as it is shown by the corresponding type profiles generated by our algorithm and presented below. The generated type profiles reflect the heterogeneity of the dataset, e.g. 6 % of the entities of “SoccerPlayer” have a “deathDate" outgoing property, and yet the generated grouping was good. The results achieved by our approach are good even when the dataset is heterogeneous.

Politician: $<(\mathop {name}\limits ^{\rightarrow }$, 1), ($\mathop {party}\limits ^{\rightarrow }$, 0.73), ($\mathop {children}\limits ^{\rightarrow }$, 0.21), ($\mathop {birthDate}\limits ^{\rightarrow }$, 0.94), ($\mathop {nationality}\limits ^{\rightarrow }$, 0.15), ($\mathop {successor}\limits ^{\leftarrow }$, 0.78), ($\mathop {deathDate}\limits ^{\rightarrow }$, 0.68), ...$>$.
SoccerPlayer: $<(\mathop {name}\limits ^{\rightarrow }$, 1), ($\mathop {height}\limits ^{\rightarrow }$, 0.46), ($\mathop {surname}\limits ^{\rightarrow }$, 0.93), ($\mathop {birthDate}\limits ^{\rightarrow }$, 1), ($\mathop {nationalteam}\limits ^{\rightarrow }$, 0.86), ($\mathop {currentMember}\limits ^{\leftarrow }$, 0.8), ($\mathop {deathDate}\limits ^{\rightarrow }$, 0.06), ...$>$.

The hierarchical links generated for DBpedia are correct as they conform to the existing rdfs:subClassOf declarations (see Fig. 7(c)). The BNF dataset has no hierarchical links, the values of both precision and recall are therefore null. When the similarity between two profiles is low, the semantic of the generic type is unclear. This is represented by a question mark in Fig. 7(b). It is the same for some of the hierarchical links generated for the Conference dataset (see Fig. 7(a)): a generic type has been generated for “Person” and “Organization”, however, the similarity between their type profiles is low. Our algorithm could not identify a generic type for “HomePage” and “PublicationPage” because their type profiles do not share any property, which explains the recall.

8 Related Works

Schema discovery from semi-structured data has been addressed by some research works. In [16], an approximate DataGuide based on the COBWEB hierarchical clustering algorithm is proposed. The incoming/outgoing edges are considered in the same way, which could be a problem in RDF datasets as it will not differentiate between the domain and the range of properties. The resulting classes are disjoint, and the approach is not deterministic as it is based on COBWEB. In [8], several types are inferred for an entity, but only when a type is more general than another, such as “Employee” and a “Person”. The approach distinguishes between incoming and outgoing edges: incoming edges are considered as roles, and potential labels for the inferred types. This is suitable for the OEM [11] model used in the approach, but not to RDF, where incoming edges do not necessarily reflect the type of an entity. The proposed algorithm requires the threshold of the jump which is not easy to define as it depends on the regularity of the data; this parameter is not known in our context. The approach presented in [9] uses bottom-up grouping providing a set of disjoint classes. Unlike in our approach, the number of classes is required. In [2], standard ascending hierarchical clustering is used to build structural summaries of linked data. Each instance is represented by its outgoing properties and the property set of a class is the union of the properties of its entities, while in our approach, the probability of each property is computed for a type. The algorithm provides disjoint classes; the hierarchical clustering tree is explored to assess the best cutoff level, which can be costly. SchemEX [6] finds the relevant data sources for a query by building an RDF triple index. Unlike in our approach, rdf:type declarations are required to find classes. The approach in [5] adds structural information to a database from a set of available databases. It searches for a similar database among the set of existing ones in order to make design decisions.

Some works have addressed the problem of enriching an existing schema by adding more structure through RDF(S)/OWL primitives. SDType [12] enriches an entity by several types using inference rules, provided that these types exist in the dataset, and it computes the confidence of each type for an entity. The focus of the approach is therefore on the evaluation of the relevance of the assigned types. In addition, rdfs:domain, rdfs:range and rdfs:subClassOf properties are required. Works in [4, 10] infer types for DBpedia only: [10] uses K-NN and [4] finds the most appropriate type for an entity in DBpedia based on descriptions from Wikipedia and links with WordNet and the Dolce ontology. The statistical schema induction approach [15] enriches an RDF dataset with some RDFS/OWL primitives, however type information must be provided. The approach in [17] uses an ascending hierarchical clustering to form classes by exploiting the existing rdf:type declarations. The approach finds hierarchical links, but is specific to the bio-medical ontologies DrugBank and Diseasome. In [7], a reverse engineering method dealing with the derivation of inheritance links embedded in a relational database is presented. Decision rules for detecting existence dependencies and translating them into hierarchies among entities are defined.

9 Conclusion

We have proposed an approach for schema discovery in RDF(S)/OWL datasets. In order to generate several types for an entity, we have adapted a density-based clustering algorithm. Each generated type is described by a profile, consisting of a property vector where each property is associated to a probability. These type profiles are used to generate overlapping types as well as semantic and hierarchical links. Our experiments show that our approach achieves good quality results regarding both types and links in the generated schema, even when the entities are very heterogeneous, such as in DBpedia. One important problem that we plan to address is the annotation of the inferred types. Indeed, in addition to identifying a cluster of entities having the same type, it is also useful to find the labels which best capture the semantics of this cluster.

Notes

1.
Silk: wifo5-03.informatik.uni-mannheim.de/bizer/silk.
2.
Conference: data.semanticweb.org/dumps/conferences/dc-2010-complete.rdf.
3.
BNF: datahub.io/fr/dataset/data-bnf-fr.
4.
DBpedia: dbpedia.org.

References

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a Web of open data. In: Aberer, K., et al. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)
Chapter Google Scholar
Christodoulou, K., Paton, N.W., Fernandes, A.A.: Structure inference for linked data sources using clustering. In: EDBT/ICDT 2013 Workshops (2013)
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd (1996)
Google Scholar
Gangemi, A., Nuzzolese, A.G., Presutti, V., Draicchio, F., Musetti, A., Ciancarini, P.: Automatic typing of DBpedia entities. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012, Part I. LNCS, vol. 7649, pp. 65–81. Springer, Heidelberg (2012)
Chapter Google Scholar
Klettke, M.: Reuse of database design decisions. In: Kouloumdjian, J., Roddick, J., Chen, P.P., Embley, D.W., Liddle, S.W. (eds.) ER Workshops 1999. LNCS, vol. 1727, pp. 213–224. Springer, Heidelberg (1999)
Chapter Google Scholar
Konrath, M., Gottron, T., Staab, S., Scherp, A.: Schemex: efficient construction of a data catalogue by stream-based indexing of linked data. WWW 16, 52–58 (2012)
Google Scholar
Lammari, N., Comyn-Wattiau, I., Akoka, J.: Extracting generalization hierarchies from relational databases: a reverse engineering approach. Data Knowl. Eng. 63(2), 568–589 (2007)
Article Google Scholar
Nestorov, S., Abiteboul, S., Motwani, R.: Inferring structure in semistructured data. ACM SIGMOD Rec. 26(4), 39–43 (1997)
Article Google Scholar
Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. ACM SIGMOD Rec. 27, 295–306 (1998). ACM
Article Google Scholar
Nuzzolese, A.G., Gangemi, A., Presutti, V., Ciancarini, P.: Type inference through the analysis of Wikipedia links. In: LDOW (2012)
Google Scholar
Papakonstantinou, Y., Garcia-Molina, H., Widom, J.: Object exchange across heterogeneous information sources. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 251–260. IEEE (1995)
Google Scholar
Paulheim, H., Bizer, C.: Type inference on noisy RDF data. In: Alani, H., et al. (eds.) ISWC 2013, Part I. LNCS, vol. 8218, pp. 510–525. Springer, Heidelberg (2013)
Chapter Google Scholar
Sánchez-Díaz, G., Martínez-Trinidad, J.F.: Determination of similarity threshold in clustering problems for large data sets. In: Sanfeliu, A., Ruiz-Shulcloper, J. (eds.) CIARP 2003. LNCS, vol. 2905, pp. 611–618. Springer, Heidelberg (2003)
Chapter Google Scholar
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web (2007)
Google Scholar
Völker, J., Niepert, M.: Statistical schema induction. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I. LNCS, vol. 6643, pp. 124–138. Springer, Heidelberg (2011)
Chapter Google Scholar
Wang, Q.Y., Yu, J.X., Wong, K.-F.: Approximate graph schema extraction for semi-structured data. In: Zaniolo, C., Grust, T., Scholl, M.H., Lockemann, P.C. (eds.) EDBT 2000. LNCS, vol. 1777, pp. 302–316. Springer, Heidelberg (2000)
Chapter Google Scholar
Zong, N., Im, D.-H., Yang, S., Namgoon, H., Kim, H.-G.: Dynamic generation of concepts hierarchies for knowledge discovering in bio-medical linked data sets. In: ICUIMC. ACM (2012)
Google Scholar

Download references

Acknowledgements

This work was partially funded by the French National Research Agency through the CAIR ANR-14-CE23-0006 project.

Author information

Authors and Affiliations

PRISM - University of Versailles Saint-Quentin-en-Yvelines, Versailles, France
Kenza Kellou-Menouer & Zoubida Kedad

Authors

Kenza Kellou-Menouer
View author publications
You can also search for this author in PubMed Google Scholar
Zoubida Kedad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kenza Kellou-Menouer or Zoubida Kedad .

Editor information

Editors and Affiliations

Stockholm University, Kista, Sweden
Paul Johannesson
National University of Singapore, Singapore, Singapore
Mong Li Lee
Brigham Young University, Provo, Utah, USA
Stephen W. Liddle
University of Bergen, Bergen, Norway
Andreas L. Opdahl
Universidad Politécnica de Valencia, Valencia, Spain
Óscar Pastor López

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kellou-Menouer, K., Kedad, Z. (2015). Schema Discovery in RDF Data Sources. In: Johannesson, P., Lee, M., Liddle, S., Opdahl, A., Pastor López, Ó. (eds) Conceptual Modeling. ER 2015. Lecture Notes in Computer Science(), vol 9381. Springer, Cham. https://doi.org/10.1007/978-3-319-25264-3_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-25264-3_36
Published: 08 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25263-6
Online ISBN: 978-3-319-25264-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Schema Discovery in RDF Data Sources

Abstract