S-RDF: A New RDF Serialization Format for Better Storage Without Losing Human Readability

Dongo, Irvin; Chbeir, Richard

doi:10.1007/978-3-030-33246-4_16

Irvin Dongo^14,15 &
Richard Chbeir¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 11877))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

2419 Accesses
1 Citations

Abstract

Nowadays, RDF data becomes more and more popular on the Web due to the advances of the Semantic Web and the Linked Open Data initiatives. Several works are focused on transforming relational databases to RDF by storing related data in N-Triple serialization format. However, these approaches do not take into account the existing normalization of their databases since N-Triple format allows data redundancy and does not control any normalization by itself. Moreover, the mostly used and recommended serialization formats, such as RDF/XML, Turtle, and HDT, have either high human-readability but waste storage capacity, or focus further on storage capacities while providing low human-readability. To overcome these limitations, we propose here a new serialization format, called S-RDF. By considering the structure (graph) and values of the RDF data separately, S-RDF reduces the duplicity of values by using unique identifiers. Results show an important improvement over the existing serialization formats in terms of storage (up to 71,66% w.r.t. N-Triples) and human readability.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Toward RDF Normalization

A Survey on RDF Data Store Based on NoSQL Systems for the Semantic Web Applications

Design of Rivalize and Software Development to Convert RDB to RDF

Keywords

1 Introduction

For the Semantic Web, RDF is the common format to describe resources, which are abstractions of entities (documents, abstract concepts, persons, companies, etc.) of the real world. It was developed by Ora Lassila and Ralph Swick in 1998 [15]. RDF uses triples in the form of $\langle \texttt {subject},\texttt {predicate},$ $\texttt {object}\rangle $ expressions, also named statements, to provide relationships among resources.

Currently, RDF data available on the Web is increasing rapidly due to the promotion of the Semantic Web and the Linked Open Data (LOD) initiatives [20]. Governments, organizations and research communities are part of the LOD initiatives, providing their data to have a more flexible data integration, increasing the data quality and providing new services [10]. Since RDF does not restrict how data is converted, several RDF serializations are available in the literature [11]. For instance, RDF/XML is historically the first W3C standard which serializes the RDF graph ($\langle \texttt {subject},\texttt {predicate},$ $\texttt {object}\rangle $) into XML. Other serializations, such as Turtle and N3, are also highly recommended [18]. In the literature, several works have been proposed to convert different datasets to RDF/OWL. The works in [12, 14, 17, 23, 26] propose to convert XML data into RDF using XPath expressions, XSD Schemas, DTD^{Footnote 1}, etc. Other works provided in [3, 4, 9, 13, 19, 21, 22, 24] address RDF conversion of relational database models to publish huge quantity of information and linked to the Web. However, current adopted serialization formats are mainly focusing on document-centric view to increase human readability, while requiring important storage space and bandwidth resources [11]. In essence, these formats do not control the redundancy of data by definition which also affects the conceptual model. The authors in [25] address the syntactic redundancy of the data by applying a normalization methodology. Other authors as in [11] propose a binary representation format called HDT, reducing the redundancy of data, but decreasing the human readability of the information.

To overcome these limitations, we propose here a new serialization format called S-RDF, which represents the RDF graph structure and the values separately for a better human readability. This serialization is available to manage medium-large datasets by reusing identifiers (keys) extracted from several ones. Moreover, the storage is reduced and some graph properties (e.g., degree centrality measure^{Footnote 2}) can be easily analyzed. We validated our serialization format through several experiments. Results show an improvement over the existing serialization formats in terms of storage (up to 71.66% with respect to N-Triples) and human readability.

The rest of this paper is organized as follows. In Sect. 2, we present a motivating scenario to illustrate better the needs. Section 3 surveys the related literature. Terminologies and definitions are presented in Sect. 4. Section 5 describes our serialization format. In Sect. 6, we present the experiments conducted to evaluate the compression rate and the human-readability. Finally, we present conclusions in Sect. 7.

2 Motivating Scenario

As mentioned previously, RDF data can be represented in different ways (serializations), i.e., stored in a file system through several formats. In order to illustrate the limitations of existing serialization formats, we consider a scenario in which the information of Listing 1 is shared on the Web. This listing shows four Schools entities: S0991, S0992, S0993, and S0994, which have information such as rdf:type, ins:name, ins:postalCode, and ins:established.

Table 1 shows the serialization formats defined by the W3C (RDF/XML, Turtle, N-Triple, and N3). These formats are document-centric view since their data can be read and understood by humans; however, for a data that generates a graph with a considerable depth (more than three), the readability is reduced. For instance, according to our motivating scenario, one can easily observe the properties of the entity S0991 (ins:name, ins:postalCode, ins:established) and its respective values of the RDF/XML, Turtle, N-Triple and N3 serialization formats, since the depth of the generated graph is 2. If some blank nodes are added between the entity and the properties, the readability decreases by finding the properties in another part of the document, using the entity and blank nodes as references to search the values.

Table 1. Serialization formats defined by the W3C

Full size table

The RDFa, microdata and JSON-LD serialization formats are adopted as recommendation by the W3C. Table 2 shows and describes the three aforementioned formats. These formats are also document centric view as the previous ones; therefore, the same limitation is found. Moreover, since all serialization formats are document centric view, the storage is not taken into account by any of them. For small datasets, it is not a need, but for medium and large datasets, especially the ones obtained from relational databases, the storage represents a critical issue and has an impact on exchanging data.

In general, the first RDF serialization formats were proposed as document-centric view (RDF/XML, Turtle), since RDF data describes mainly Web Pages as resources (e.g., DBpedia from Wikipedia) and the number of properties to described them is limited (About: Eiffel Tower is describe by 156 triples); however, as the resources can be linked on the Web, the number of triples increases exponentially by considering datasets that use several resources. Therefore, a format able to describe a resource or a set of resources is needed considering the storage as a main requirement for medium-large datasets.

By regarding the limitations of existing serialization formats, we have identified three main requirements according to the challenges and objectives of this work:

A high-human readability for easy understanding of data;
A high radio compression for minimizing the storage space and reducing exchanging delays; and
A format oriented to describe medium-large datasets.

The following section describes and compares the related work by using the identified requirements.

Table 2. Serialization formats recommended by the W3C

Full size table

3 Related Work

To the best of our knowledge, several serialization formats have been also proposed in the literature other than the ones adopted or recommended by the W3C. The authors in [8] present a binary RDF representation for large datasets. They represent the RDF graph in three logical components: (i) Header, (ii) Dictionary, and (iii) Triples. The size of the datasets is reduced, improving the data sharing and the querying and indexing performance. In [11], the authors improve their previous work up to 2 times for more structured datasets, and a significant improvement for semi-structured datasets as DBpedia. Other works, as in [5], have focused on compressed representation for RDF Querying. The authors highlight that the improvement is around 50% to 60% of the original HDT. This format is proposed for the use of GPU.

Table 3. Related work classification

Full size table

Table 3 shows our related work classification. RDF/XML, Turtle, N3 and JSON-LD focus on human readability since their formats can be easily read by humans. HDT, HDT++ and TripleID-C have been designed to improve the storage, affecting the human readability. Note that none of the works satisfies all the defined requirements; thus, a new RDF serialization format is required.

Before describing our serialization format, the following section introduces some common terminologies and definitions in the context of RDF.

4 RDF Terminologies and Definitions

RDF commonly uses triples in the form of $\langle \texttt {subject},\texttt {predicate},$ $\texttt {object}\rangle $ expressions/statements, to provide relationships among resources. The RDF triples can be composed of the following elements:

An IRI, which is an extension of the Uniform Resource Identifier (URI) scheme to a much wider repertoire of characters from the Universal Character Set (Unicode/ISO 10646), including Chinese, Japanese, and Korean character sets [7].
A Blank Node, representing a local identifier used in some concrete RDF syntaxes or RDF store implementations. A blank node can be associated with an identifier (rdf:nodeID) to be referenced in the local document, which is generated manually or automatically
A Literal Node, representing values as strings, numbers, and dates. According to the definition in [6], it consists of two or three parts:
- A lexical form, being a Unicode string, which should be in Normal Form C^{Footnote 3} to assure that equivalent strings have a unique binary representation
- A datatype IRI, being an IRI identifying a datatype that determines how the lexical form maps to an object value
- A non-empty language tag as defined by “Tags for Identifying Languages” [2], if and only if the datatype IRI is http://www.w3.org/1999/02/22-rdf-syntax-ns#langString.

Table 4 shows the sets of RDF’s elements that we use in our formal approach description.

Table 4. Description of sets

Full size table

After the definition of sets of RDF’elements, we formally describe a triple in Definition 1.

Definition 1

Triple ( t ): A Triple, denoted as t, is defined as an atomic structure consisting of a 3-tuple with a Subject (s), a Predicate (p), and Object (o), denoted as $t:<s,p,o>$, where:

$s \in I \cup BN$ represents the subject to be described;
p is a predicate defined as an IRI in the form , where $namespace\_prefix$ is a local identifier of the IRI, in which the predicate ($predicate\_name$) is defined. The predicate (p) is also known as the property of the triple;
$o \in I \cup BN \cup L$ describes the object. $\blacklozenge $

From Listing 1, one can observe the following triples with different RDF resources, properties, and literals:

$t_3$:
$t_4$:
$t_5$:

In this study, we also consider two types of properties (predicates):

Entity Property ( ep ): A predicate is an entity property when it is related to an IRI or a blank node. It is also known as Object property. For example, the property eni:locates is an entity property since it is related to a blank node.
Value Property ( vp ): A predicate is a value property when it is related to a literal node. It is also known as Datatype property. For example, the property ins:established is a value property since it is related to a literal node.

An RDF document is defined as an encoding of a set of triples, using a predefined serialization format complying with an RDF W3C standards, such as RDF/XML, Turtle, N3, etc. Additionally, we use the term entity, formally described in Definition 2, to identify an RDF resource (blank node and IRI).

Definition 2

Entity ( e ): An entity in an RDF document, denoted as e, is represented as an IRI or a blank node (e.g., School, Power Plant). $\blacklozenge $

For example, from Listing 1, the triple has the entity S0991.

In Definitions 3, 4, 5, and 6, we formally describe the respective sets of entities, entity properties, value properties, and literal values of an RDF document.

Definition 3

Entity Set ( E ): Given a set of triples $T=\{t_i \mid t_i:<s,p,o>\}$, the entities of each $t_i$ define the set of all entities, denoted as $E=\bigcup _{i=1}^{n}t_i.s \cup t_i.o \iff t_i.o \in I \cup BN$, where n is the number of triples. $\blacklozenge $

The entity set according to Definition 3 of Listing 1 is: E = {http://institutions.com/0.2/S0991, http://institutions.com.com/0.2/S0992, http://institutions.com/0.2/S0993, http://institutions.com/0.2/S0994, http://www.w3.org/2002/07/owl#Thing}.

Definition 4

Entity Properties ( EP ): Given a set of triples $T=\{t_i \mid t_i:<s,p,o>\}$, the predicates of all $t_i$ that are entity properties, define the set of entity properties, denoted as: $EP = \bigcup _{i=1}^{n}t_i.p \iff t_i.o \in I \cup BN$, where n is the number of triples. $\blacklozenge $

The entity properties from Listing 1 are: EP = $\{$rdf:type$\}$ or EP = {http://www.w3.org/1999/02/22-rdf-syntax-ns#type}.

Definition 5

Value Properties ( VP ): Given a set of triples $T=\{t_i \mid t_i:<s,p,o>\}$, the predicate of all $t_i$ that are value properties, define the set of value properties, denoted as: $VP= \bigcup _{i=1}^{n}t_i.p \iff t_i.o \in L$, where n is the number of triples. $\blacklozenge $

According to Definition 5, the value properties obtained from the triples of Listing 1 are: VP = {http://institutions.com/0.2/name, http://institutions.com/0.2/postal-Code, http://institutions.com/0.2/established}.

Definition 6

Literal Values ( LV ): Given a set of triples $T=\{t_i \mid t_i:<s,p,o>\}$, the literals of all $t_i$ define the set of literal values, denoted as: $LV=\bigcup _{i=1}^{n}t_i.o \iff t_i.o \in L$, where n is the number of triples. $\blacklozenge $

According to Definition 6, the literal values from Listing 1 are: VA={“Lycée de la Plage”, “64600”, “1985-05-19”, “Napoleon Business”, “64100”, “1986-12-19”, “École National de l’energie”, “64500”, “1984-11-21”, “Grande Ville School”, “64200”, “1977-08-22”}.

Table 5 summaries the sets of entities, entity and value properties, and literal values of an RDF document.

Table 5. Description of sets of data of an RDF document

Full size table

The following section presents and describes our new serialization format S-RDF.

5 S-RDF: Our Proposal

Our proposal mainly relies on a three step process: (i) Extraction of RDF elements, where the input, an RDF document in any format, is analyzed in order to extract the set of entities (E), entity properties (EP), value properties (VP), and literal values (LV); (ii) RDF Sequence-Value generation where entities, properties, and literal values are represented by unique identifiers (e.g., primary keys); and (iii) RDF Sequence-Structure generation where relations among entities, which define the RDF graph structure, are expressed using the Sequence-Value Representation. Thus, our serialization format (S-RDF) consists in two parts: (i) Value Representation and (ii) Structure. Figure 1 shows the framework of our proposal composed by three modules that materialize the three respective phases.

In Definition 7, we formally describe the Value Representation part of our RDF sequence, called RDF Sequence–Value. This representation associates to each entity, entity and value property, and literal value a unique identifier to be used in the structure representation of the sequence. We propose four different identifiers to easily recognize the type of data in the second part of our sequence. The entities are represented by numbers of the decimal numeral system (base 10), starting from 1. In the case of entity and value properties, both identifiers correspond to the hexavigesimal numeral system (base 26), with a domain of lowercase and uppercase alphabet letters, respectively. For the literal values, the identifiers belong to the decimal numeral system as the ones of entities, but a symbol “_” is added as a prefix. For instance, the 28$^{th}$ element of the entities is represented as “28”, “AB” for entity properties, “ab” for value properties, while for literal values is “_28”.

Definition 7

RDF Sequence–Value ( S-RDF-V ): Given a set of triples $T=\{t_i \mid t_i:<s,p,o>\}$, its RDF Sequence-Value is defined as a 4-tuple of:

$$\begin{aligned}&S-RDF-V_T =<\\&\qquad \qquad Entities=\{{\bigcup }_{i=1}^{m}<pk_i,e_i,type_i>\},\\&\qquad \qquad Entity\_properties=\{{\bigcup }_{j=1}^{n}<pk_j,ep_j>\},\\&\qquad \qquad Value\_properties=\{{\bigcup }_{k=1}^{o}<pk_k,vp_k,datatype_k>\},\\&\qquad \qquad Literal\_values=\{{\bigcup }_{l=1}^{p}<\_pk_l,lv_l>\}> \end{aligned}$$

where:

Entities is a set of 3-tuples, where:

$*$ $m \in Z^+$, is the size of E.

$*$ $pk_i \in Z^+$, is a key that represents $e_i$.

$*$ $e_i \in E$ is an entity.

$*$ $type_i\in \{1,2\}$, is the type of the entity $e_i$ (1=IRI, 2=blank node).
Entity_properties is a set of 2-tuples, where:

$*$ $n \in Z^+$, is the size of EP.

$*$ $pk_j \in \{A...Z\}$, is a key that represents $ep_{j}$.

$*$ $ep_j \in EP$ is an entity property.
Value_properties is a set of 3-tuples, where:

$*$ $o \in Z^+$, is the size of VP.

$*$ $pk_k\in \{a...z\}$, is a key that represents $vp_k$.

$*$ $vp_k \in VP$ is a value property.

$*$ $datatype_k$ is the datatype of the property.
Literal_values is a set of 2-tuples, where:

$*$ $p \in Z^+$, is the size of LV.

$*$ $\_pk_l$ is a key that represents $lv_l$ and $pk_l\in Z^+$.

$*$ $lv_l \in LV$ is a literal value.

$\blacklozenge $

Tables 6, 7, 8 and 9 represent the Entities, Entity_properties, Value_properties, and Literal_values of Listing 1. The first element of the S-RDF-V is composed by the entities of Table 6. As only one relation among entities is shown in Listing 1, the second element (entity properties) of the 4-tuple is: $\{<$A, http://www.w3.org/1999/02/22-rdf-syntax-ns#type$>\}$. The third and fourth elements are composed by the information in Tables 8 and 9, respectively. The set of triples (T), obtained from Listing 1, has the following RDF Sequence–Value:

Table 6. Entities

Full size table

Table 7. Entity properties

Full size table

Table 8. Value properties

Full size table

Table 9. Literal values

Full size table

The S-RDF-V represents the entities, properties, and values of an RDF document, but a document also has information about the relations among entities and literal values (node-edge-node); thus, the second part of our serialization is dedicated to represent the RDF graph structure, called RDF Sequence–Structure. It consists of a 3-tuple, where the first element is composed of an entity; the second element has all entities, which are related to the first element, preceded by its respective entity property; and the last element is used to represent value properties and its respective literal values. The RDF Sequence–Structure is defined in Definition 8.

Definition 8

RDF Sequence-Structure ( S-RDF-S ): Given a set of triples $T=\{t_i \mid t_i:<s,p,o>\}$, its RDF Sequence-Structure is defined as a set of 3-tuples:

For example, the set of triples (T) obtained from Listing 1, has the following RDF Sequence–Structure:

representing: entity “1” (http://institutions.com/0.2/S0991 according to Table 6), has an entity property “ A” (http://www.w3.org/1999/02/ 22-rdf-syntax-ns#type according to Table 7), related to the entity “5” (http://www.w3.org/2002/07/owl#Thing according to Table 6). It also has a property value “a” (http://institutions.com/0.2/name according to Table 8) with a literal value “_1” ("Lycée de la Plage" according to Table 9), and so on.

Once values and structure of the RDF data are defined, we formalize the whole RDF Sequence in Definition 9.

Definition 9

RDF Sequence ( S-RDF ): Given a set of triples $T=\{t_i \mid t_i:<s,p,o>\}$, its RDF Sequence is a 2-tuple consisting of two parts, defined as:

$$S-RDF(T) = \,<\!\!S-RDF-V(T), S-RDF-S(T)\!\!>$$

where:

S-RDF-V(T) is the set of values of T defined in Definition 7.
S-RDF-S(T) is the structure of T defined in Definition 8. $\blacklozenge $

The S-RDF is built to represent triples considering the structure and values separately. Thus, an analysis over either the data or structure can be easily performed. Another benefit of this serialization format is the easy detection of some graph properties as the number of relationships (e.g., degree centrality measure) with respect to other serialization formats. Moreover, the storage space is reduced, since an IRI, which appears several times in an RDF document as a resource or property, is represented as a unique short key (e.g., key:1 represents value: http://institutions.com/0.2/S0991 or key:A represents value: http://www.w3.org/2002/07/owl#Thing, respectively). This new serialization format can be consider as part of RDF partition strategies where the models improve the storage and the querying; however, when the repository is exported/outsourced, the format is still the same (e.g., RDF/XML, Turtle). Our serialization is a new way to represent data to be shared on the Web, improving the storage without losing the readability.

In the following section, we evaluate our S-RDF with respect to the current serialization formats.

6 Experimental Evaluation

6.1 Experimental Environment and Datasets

In order to evaluate and validate our serialization format, we developed a desktop and online^{Footnote 4} prototype system based on Java and Jena^{Footnote 5} to manage the RDF data. Experiments were undertaken on a MacBook Pro, 2.2 GHz Intel Core(TM) i7 with 16.00 GB, running a MacOS Mojave and using a Sun JDK 1.7 programming environment.

Our prototype was used to perform several experiments to evaluate the viability and the compression rate of our approach in comparison with the works proposed in the literature. To do so, we considered two datasets:

Data 1: the DBpedia person data^{Footnote 6} with 16,842,176 triples; and
Data 2: the DBpedia geo coordinates^{Footnote 7} with 151,205 triples.

Note that some of the serialization formats (e.g., RDFa, HDT++) described in the related work section were not evaluated since there are no tools available that can manage huge quantity of triples. They are mainly document oriented converters (e.g., Easy-RDF^{Footnote 8}, RDF-Translator^{Footnote 9}). For our readability test, HDT and HDT+ formats were analyzed since they have a binary representation and cannot be read by humans.

We describe as follows the tests performed to evaluate our proposal.

Table 10. Related work comparison for Data 1

Full size table

6.2 Evaluation

Test 1: We chose randomly 50,000 triples from Data 1 in order to measure the compression rate of the data with respect to the size of the input (6,102,029 bytes). Table 10 shows the results obtained for this test. HDT serialization format clearly overcomes the other ones (82.3936%), since it was created to minimize the storage. However, our serialization has also a good result (71.6564%) without losing the human readability criterion as the binary representation of HDT does. JSON-LD serialization has the biggest compression rate (39.0276%) among the W3C recommendation formats.

For Data 2, we also chose 50,000 triples from this dataset, having a size of 7,356,637 bytes. Table 11 shows similar results as the ones of Data 1. HDT obtained the best result with 75.6508%, while for our serialization format was 70.7767%. The JSON-LD serialization format has a 59.6130% of compression rate with respect to the input size.

Table 11. Related work comparison for Data 2

Full size table

Test 2: Since there is no benchmark model for readability available in the literature to compare the existing serialization formats, we propose three questions which are related to several aspects of the RDF structure. (i) The first question is about relations, which can help to the end-user to recognize some important nodes according to the context, (ii) the second one is related to the terminal nodes, and (iii) the third one to literal values. The questions are presented as follows:

1.
Is the resource X the most related one of the data?
2.
Is the resource Y a terminal node in the data?
3.
How many literal values has the resource Z?

where X, Y, and Z are resources that belong to the set of triples used to evaluate this test (see Listing 2).

In this test, we evaluated our human readability criterion by surveying 40 people that have under- and post-graduate degrees in computer science^{Footnote 10}. The participants evaluated the serialization formats through the three previous questions, choosing an option to answer them: Yes, No, and I do not know for the two first questions, and a value among 1 to 5 and “I do not know” option for the third one.

Table 12. Number of correct, incorrect, and ambiguous values of each question per serialization format

Full size table

To evaluate the results, we calculated the F-measure, based on the Recall (R) and Precision (PR). These criteria are commonly adopted in information retrieval and are calculated as follows:

$$\begin{aligned} \mathbf {PR} = \dfrac{A}{A+B} \in \left[ 0,1 \right] \quad \,\, \mathbf {R} = \dfrac{A}{A+C} \in \left[ 0,1 \right] \quad \,\, \mathbf {F}{} \mathbf - \mathbf {measure} = \dfrac{2 \times PR \times R}{PR+R} \in \left[ 0,1 \right] \end{aligned}$$

where A is the number of correct answers; B is the number of wrong answers; and C is the number of “I do not know” options selected by the participants.

Table 12 shows the results obtained for this evaluation. For Question 1, the N3 serialization format obtained the best Precision (84.62%), while the one for our serialization format was 84.00%. RDF/XML and JSON-LD obtained the lowest Precision (22.73% and 36.36%, respectively). By regarding the F-measure, we can observe that Turtle, N3, and our proposal (S-RDF) help user to identify some graph properties as the centrality measure, since they obtained a high result (over 68.00%). For Question 2, which is related to identify terminal nodes, most of the serialization formats obtained a similar F-measure ($\approx $61.00%), but for the RDF/XML format, the F-measure was 43.14% due to the low Recall (35.48%). A low Recall can be interpreted as the serialization format is not easy-readable for the user. For Question 3, Turtle obtained the best F-measure (73.02%), while for S-RDF the value was 68.85%. By analyzing the answers, we noticed that some people confused the entity property and its respective value as a literal value since they only counted the number of elements associated to the entity.

Table 13 shows the global results of this test. In this table, we can identify two groups: G1: RDF/XML, N-Triples, and JSON-LD with a F-measure around 43.00%, and G2: Turtle, N3, and S-RDF with a value around 68.00%. One of the reasons of the low F-measure obtained by G1, is that these formats were created to keep the interoperability among system, using XML and JSON formats for example. The results demonstrate that our serialization format (S-RDF) can improve the storage without losing the human-readability criterion.

Table 13. Total number of correct, incorrect and ambiguous values per serialization format

Full size table

7 Conclusion

In this paper, we propose a new serialization format, called S-RDF, which represents the RDF graph structure and values, separately. This format is focused on human readability, storage, and data redundancy to represent medium and large datasets. We evaluated our serialization format in terms of compression rate and human readability with respect to the state of the art. Results show a high compression without losing human readability, which is an advantage over the serialization formats created to minimize storage. According to the survey evaluation, our S-RDF allows identify easily the resources with more relations in the RDF graph (degree centrality measure) by identifying the entity with the bigger number of entity properties.

We are currently working on normalization methods over the S-RDF in order to provide a unique and deterministic output for similar inputs.

Notes

1.
Document Type Definition (DTD) defines the structure and the legal elements and attributes of an XML document.
2.
Centrality identifies the most related nodes within a graph, which have a high number of relations.
3.
It is one of the four normalization forms, which consists on a Canonical Decomposition, followed by a Canonical Composition -http://www.unicode.org/reports/tr15/.
4.
S-RDF: http://rdf-sequence.sigappfr.org.
5.
Jena is a Java framework for building Semantic Web applications. It provides a extensive Java libraries for helping developers develop code that handles RDF, RDFS, RDFa, OWL and SPARQL in line with published W3C recommendations - https://jena.apache.org/about_jena/about.html.
6.
Information about persons extracted from the English and Germany Wikipedia, represented by the FOAF vocabulary - http://wiki.dbpedia.org/Downloads2015-10.
7.
Geographic coordinates extracted from Wikipedia - https://wiki.dbpedia.org/downloads-2016-10.
8.
Easy-Converte: http://www.easyrdf.org/converter.
9.
RDF-Translator: https://rdf-translator.appspot.com.
10.
The form is available here: https://forms.gle/DNMfsp5LL3nw1hW9A.

References

Microdata to RDF - Second Edition - Transformation from HTML+Microdata to RDF. https://www.w3.org/TR/microdata-rdf/ (2014). Accessed 01 July 2019
Phillips, M.D.A.: Tags for identifying languages. https://tools.ietf.org/html/bcp47. Accessed 01 July 2019
Bornea, M.A., et al.: Building an efficient RDF store over a relational database. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 121–132. ACM, New York (2013)
Google Scholar
Būmans, G., Čerāns, K.: RDB2OWL: A practical approach for transforming RDB data into RDF/OWL. In: Proceedings of the 6th International Conference on Semantic Systems, I-SEMANTICS 2010, pp. 25:1–25:3. ACM, New York (2010)
Google Scholar
Chantrapornchai, C., Makpaisit, P.: TripleiD-C: low cost compressed representation for RDF query processing in GPUs. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2018, pp. 261–270. ACM, New York (2018)
Google Scholar
Cyganiak, R., Wood, D., Lanthaler, M.: RDF 1.1 concepts and abstract syntax. Technical report (2014). Accessed 06 Dec 2016
Google Scholar
Duerst, M., Suignard, M.: Internationalized resource identifiers (IRIs). Technical report, Microsoft Corporation (2004)
Google Scholar
Fernández, J.D.: Binary RDF for scalable publishing, exchanging and consumption in the web of data. In: Proceedings of the 21st International Conference on World Wide Web, WWW 2012 Companion, pp. 133–138. ACM, New York (2012)
Google Scholar
Goasdoué, F., Manolescu, I., Roatiş, A.: Getting more RDF support from relational databases. In: Proceedings of the 21st International Conference on World Wide Web, WWW 2012 Companion, pp. 515–516. ACM, New York (2012)
Google Scholar
Hausenblas, M., Ding, L., Peristeras, V.: Linked open government data. IEEE Intell. Syst. 27, 11–15 (2012)
Google Scholar
Hernández-Illera, A., Martínez-Prieto, M.A., Fernández, J.D.: Serializing RDF in compressed space. In: 2015 Data Compression Conference, pp. 363–372, April 2015
Google Scholar
Huang, J.-Y., Lange, C., Auer, S.: Streaming transformation of XML to RDF using XPath-based mappings. In: Proceedings of the 11th International Conference on Semantic Systems, SEMANTICS 2015, pp. 129–136. ACM, New York (2015)
Google Scholar
Konstantinou, N., Kouis, D., Mitrou, N.: Incremental export of relational database contents into RDF graphs. In: Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14), WIMS 2014, pp. 33:1–33:8. ACM, New York (2014)
Google Scholar
Lacoste, D., Sawant, K.P., Roy, S.: An efficient XML to OWL converter. In: Proceedings of the 4th India Software Engineering Conference, ISEC 2011, pp. 145–154. ACM, New York (2011)
Google Scholar
Lassila, O., Swick, R.R., Wide, W., Consortium, W.: Resource description framework (RDF) model and syntax specification (1998)
Google Scholar
Kellogg, G., Lanthaler, M., Lindström, N., Sporny, M., Longley, D.: JSON-LD 1.0, A JSON-based Serialization for Linked Data, W3C Recommendation 16 January 2014 (2014). https://www.w3.org/TR/json-ld/. Accessed 27 Oct 2017
O’Connor, M.J., Das, A.: Acquiring OWL ontologies from XML documents. In: Proceedings of the Sixth International Conference on Knowledge Capture, K-CAP 2011, pp. 17–24. ACM, New York (2011)
Google Scholar
Patel-Schneider, P.F., Hayes, P.J.: RDF 1.1 Semantics, W3C Recommendation 25 February 2014 (2014). https://www.w3.org/TR/rdf11-mt/#literals-and-datatypes. Accessed 01 July 2019
Salas, P.E., Marx, E., Mera, A., Viterbo, J.: RDB2RDF plugin: relational databases to RDF plugin for eclipse. In: Proceedings of the 1st Workshop on Developing Tools As Plug-ins, TOPI 2011, pp. 28–31. ACM, New York (2011)
Google Scholar
Sandro Hawke, P.A., Herman, I.: W3C semantic web activity (2001). https://www.w3c.org/2001/sw/. Accessed 06 Dec 2018
Sequeda, J.F., Arenas, M., Miranker, D.P.: On directly mapping relational databases to RDF and OWL. In: Proceedings of the 21st International Conference on World Wide Web, WWW 2012, pp. 649–658. ACM, New York (2012)
Google Scholar
Stefanova, S., Risch, T.: Scalable reconstruction of RDF-archived relational databases. In: Proceedings of the Fifth Workshop on Semantic Web Information Management, SWIM 2013, pp. 5:1–5:4. ACM, New York (2013)
Google Scholar
Thuy, P.T.T., Lee, Y.-K., Lee, S.: DTD2OWL: automatic transforming XML documents into OWL ontology. In: Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human, ICIS 2009, pp. 125–131. ACM, New York (2009)
Google Scholar
Thuy, P.T.T., Thuan, N.D., Han, Y., Park, K., Lee, Y.-K.: RDB2RDF: completed transformation from relational database into RDF ontology. In: Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication, ICUIMC 2014, pp. 88:1–88:7. ACM, New York (2014)
Google Scholar
Ticona-Herrera, R., Tekli, J., Chbeir, R., Laborie, S., Dongo, I., Guzman, R.: Toward RDF normalization. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 261–275. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25264-3_19
Chapter Google Scholar
Vion-Dury, J.-Y.: Using RDFS/OWL to ease semantic integration of structured documents. In: Proceedings of the 2013 ACM Symposium on Document Engineering, DocEng 2013, pp. 189–192. ACM, New York (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Univ. Bordeaux, ESTIA, Bidart, France
Irvin Dongo
Electrical and Electronics Engineering Department, Universidad Católica San Pablo, Arequipa, Peru
Irvin Dongo
Univ. Pau & Pays Adour, E2S/UPPA, LIUPPA, EA3000, Anglet, France
Richard Chbeir

Authors

Irvin Dongo
View author publications
You can also search for this author in PubMed Google Scholar
Richard Chbeir
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Irvin Dongo .

Editor information

Editors and Affiliations

University of Lorraine, Vandoeuvre Les Nancy Cedex, France
Hervé Panetto
Trinity College Dublin, Dublin, Ireland
Christophe Debruyne
Universität der Bundeswehr München, Munich, Germany
Martin Hepp
Trinity College Dublin, Dublin, Ireland
Dave Lewis
Università degli Studi di Milano Crema, Crema, Italy
Claudio Agostino Ardagna
TU Graz, Graz, Austria
Robert Meersman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dongo, I., Chbeir, R. (2019). S-RDF: A New RDF Serialization Format for Better Storage Without Losing Human Readability. In: Panetto, H., Debruyne, C., Hepp, M., Lewis, D., Ardagna, C., Meersman, R. (eds) On the Move to Meaningful Internet Systems: OTM 2019 Conferences. OTM 2019. Lecture Notes in Computer Science(), vol 11877. Springer, Cham. https://doi.org/10.1007/978-3-030-33246-4_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-33246-4_16
Published: 11 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33245-7
Online ISBN: 978-3-030-33246-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

S-RDF: A New RDF Serialization Format for Better Storage Without Losing Human Readability

Abstract