Keywords

1 Introduction

For the Semantic Web, RDF is the common format to describe resources, which are abstractions of entities (documents, abstract concepts, persons, companies, etc.) of the real world. It was developed by Ora Lassila and Ralph Swick in 1998 [15]. RDF uses triples in the form of \(\langle \texttt {subject},\texttt {predicate},\) \(\texttt {object}\rangle \) expressions, also named statements, to provide relationships among resources.

Currently, RDF data available on the Web is increasing rapidly due to the promotion of the Semantic Web and the Linked Open Data (LOD) initiatives [20]. Governments, organizations and research communities are part of the LOD initiatives, providing their data to have a more flexible data integration, increasing the data quality and providing new services [10]. Since RDF does not restrict how data is converted, several RDF serializations are available in the literature [11]. For instance, RDF/XML is historically the first W3C standard which serializes the RDF graph (\(\langle \texttt {subject},\texttt {predicate},\) \(\texttt {object}\rangle \)) into XML. Other serializations, such as Turtle and N3, are also highly recommended [18]. In the literature, several works have been proposed to convert different datasets to RDF/OWL. The works in [12, 14, 17, 23, 26] propose to convert XML data into RDF using XPath expressions, XSD Schemas, DTDFootnote 1, etc. Other works provided in [3, 4, 9, 13, 19, 21, 22, 24] address RDF conversion of relational database models to publish huge quantity of information and linked to the Web. However, current adopted serialization formats are mainly focusing on document-centric view to increase human readability, while requiring important storage space and bandwidth resources [11]. In essence, these formats do not control the redundancy of data by definition which also affects the conceptual model. The authors in [25] address the syntactic redundancy of the data by applying a normalization methodology. Other authors as in [11] propose a binary representation format called HDT, reducing the redundancy of data, but decreasing the human readability of the information.

To overcome these limitations, we propose here a new serialization format called S-RDF, which represents the RDF graph structure and the values separately for a better human readability. This serialization is available to manage medium-large datasets by reusing identifiers (keys) extracted from several ones. Moreover, the storage is reduced and some graph properties (e.g., degree centrality measureFootnote 2) can be easily analyzed. We validated our serialization format through several experiments. Results show an improvement over the existing serialization formats in terms of storage (up to 71.66% with respect to N-Triples) and human readability.

The rest of this paper is organized as follows. In Sect. 2, we present a motivating scenario to illustrate better the needs. Section 3 surveys the related literature. Terminologies and definitions are presented in Sect. 4. Section 5 describes our serialization format. In Sect. 6, we present the experiments conducted to evaluate the compression rate and the human-readability. Finally, we present conclusions in Sect. 7.

2 Motivating Scenario

As mentioned previously, RDF data can be represented in different ways (serializations), i.e., stored in a file system through several formats. In order to illustrate the limitations of existing serialization formats, we consider a scenario in which the information of Listing 1 is shared on the Web. This listing shows four Schools entities: S0991, S0992, S0993, and S0994, which have information such as rdf:type, ins:name, ins:postalCode, and ins:established.

figure a

Table 1 shows the serialization formats defined by the W3C (RDF/XML, Turtle, N-Triple, and N3). These formats are document-centric view since their data can be read and understood by humans; however, for a data that generates a graph with a considerable depth (more than three), the readability is reduced. For instance, according to our motivating scenario, one can easily observe the properties of the entity S0991 (ins:name, ins:postalCode, ins:established) and its respective values of the RDF/XML, Turtle, N-Triple and N3 serialization formats, since the depth of the generated graph is 2. If some blank nodes are added between the entity and the properties, the readability decreases by finding the properties in another part of the document, using the entity and blank nodes as references to search the values.

Table 1. Serialization formats defined by the W3C

The RDFa, microdata and JSON-LD serialization formats are adopted as recommendation by the W3C. Table 2 shows and describes the three aforementioned formats. These formats are also document centric view as the previous ones; therefore, the same limitation is found. Moreover, since all serialization formats are document centric view, the storage is not taken into account by any of them. For small datasets, it is not a need, but for medium and large datasets, especially the ones obtained from relational databases, the storage represents a critical issue and has an impact on exchanging data.

In general, the first RDF serialization formats were proposed as document-centric view (RDF/XML, Turtle), since RDF data describes mainly Web Pages as resources (e.g., DBpedia from Wikipedia) and the number of properties to described them is limited (About: Eiffel Tower is describe by 156 triples); however, as the resources can be linked on the Web, the number of triples increases exponentially by considering datasets that use several resources. Therefore, a format able to describe a resource or a set of resources is needed considering the storage as a main requirement for medium-large datasets.

By regarding the limitations of existing serialization formats, we have identified three main requirements according to the challenges and objectives of this work:

  • A high-human readability for easy understanding of data;

  • A high radio compression for minimizing the storage space and reducing exchanging delays; and

  • A format oriented to describe medium-large datasets.

The following section describes and compares the related work by using the identified requirements.

Table 2. Serialization formats recommended by the W3C

3 Related Work

To the best of our knowledge, several serialization formats have been also proposed in the literature other than the ones adopted or recommended by the W3C. The authors in [8] present a binary RDF representation for large datasets. They represent the RDF graph in three logical components: (i) Header, (ii) Dictionary, and (iii) Triples. The size of the datasets is reduced, improving the data sharing and the querying and indexing performance. In [11], the authors improve their previous work up to 2 times for more structured datasets, and a significant improvement for semi-structured datasets as DBpedia. Other works, as in [5], have focused on compressed representation for RDF Querying. The authors highlight that the improvement is around 50% to 60% of the original HDT. This format is proposed for the use of GPU.

Table 3. Related work classification

Table 3 shows our related work classification. RDF/XML, Turtle, N3 and JSON-LD focus on human readability since their formats can be easily read by humans. HDT, HDT++ and TripleID-C have been designed to improve the storage, affecting the human readability. Note that none of the works satisfies all the defined requirements; thus, a new RDF serialization format is required.

Before describing our serialization format, the following section introduces some common terminologies and definitions in the context of RDF.

4 RDF Terminologies and Definitions

RDF commonly uses triples in the form of \(\langle \texttt {subject},\texttt {predicate},\) \(\texttt {object}\rangle \) expressions/statements, to provide relationships among resources. The RDF triples can be composed of the following elements:

  • An IRI, which is an extension of the Uniform Resource Identifier (URI) scheme to a much wider repertoire of characters from the Universal Character Set (Unicode/ISO 10646), including Chinese, Japanese, and Korean character sets [7].

  • A Blank Node, representing a local identifier used in some concrete RDF syntaxes or RDF store implementations. A blank node can be associated with an identifier (rdf:nodeID) to be referenced in the local document, which is generated manually or automatically

  • A Literal Node, representing values as strings, numbers, and dates. According to the definition in [6], it consists of two or three parts:

    • A lexical form, being a Unicode string, which should be in Normal Form CFootnote 3 to assure that equivalent strings have a unique binary representation

    • A datatype IRI, being an IRI identifying a datatype that determines how the lexical form maps to an object value

    • A non-empty language tag as defined by “Tags for Identifying Languages” [2], if and only if the datatype IRI is http://www.w3.org/1999/02/22-rdf-syntax-ns#langString.

Table 4 shows the sets of RDF’s elements that we use in our formal approach description.

Table 4. Description of sets

After the definition of sets of RDF’elements, we formally describe a triple in Definition 1.

Definition 1

Triple ( t ): A Triple, denoted as t, is defined as an atomic structure consisting of a 3-tuple with a Subject (s), a Predicate (p), and Object (o), denoted as \(t:<s,p,o>\), where:

  • \(s \in I \cup BN\) represents the subject to be described;

  • p is a predicate defined as an IRI in the form , where \(namespace\_prefix\) is a local identifier of the IRI, in which the predicate (\(predicate\_name\)) is defined. The predicate (p) is also known as the property of the triple;

  • \(o \in I \cup BN \cup L\) describes the object.    \(\blacklozenge \)

From Listing 1, one can observe the following triples with different RDF resources, properties, and literals:

  • \(t_3\):

  • \(t_4\):

  • \(t_5\):

In this study, we also consider two types of properties (predicates):

  • Entity Property ( ep ): A predicate is an entity property when it is related to an IRI or a blank node. It is also known as Object property. For example, the property eni:locates is an entity property since it is related to a blank node.

  • Value Property ( vp ): A predicate is a value property when it is related to a literal node. It is also known as Datatype property. For example, the property ins:established is a value property since it is related to a literal node.

An RDF document is defined as an encoding of a set of triples, using a predefined serialization format complying with an RDF W3C standards, such as RDF/XML, Turtle, N3, etc. Additionally, we use the term entity, formally described in Definition 2, to identify an RDF resource (blank node and IRI).

Definition 2

Entity ( e ): An entity in an RDF document, denoted as e, is represented as an IRI or a blank node (e.g., School, Power Plant).    \(\blacklozenge \)

For example, from Listing 1, the triple has the entity S0991.

In Definitions 3, 4, 5, and 6, we formally describe the respective sets of entities, entity properties, value properties, and literal values of an RDF document.

Definition 3

Entity Set ( E ): Given a set of triples \(T=\{t_i \mid t_i:<s,p,o>\}\), the entities of each \(t_i\) define the set of all entities, denoted as \(E=\bigcup _{i=1}^{n}t_i.s \cup t_i.o \iff t_i.o \in I \cup BN\), where n is the number of triples.    \(\blacklozenge \)

The entity set according to Definition 3 of Listing 1 is: E = {http://institutions.com/0.2/S0991, http://institutions.com.com/0.2/S0992, http://institutions.com/0.2/S0993, http://institutions.com/0.2/S0994, http://www.w3.org/2002/07/owl#Thing}.

Definition 4

Entity Properties ( EP ): Given a set of triples \(T=\{t_i \mid t_i:<s,p,o>\}\), the predicates of all \(t_i\) that are entity properties, define the set of entity properties, denoted as: \(EP = \bigcup _{i=1}^{n}t_i.p \iff t_i.o \in I \cup BN\), where n is the number of triples.    \(\blacklozenge \)

The entity properties from Listing 1 are: EP = \(\{\)rdf:type\(\}\) or EP = {http://www.w3.org/1999/02/22-rdf-syntax-ns#type}.

Definition 5

Value Properties ( VP ): Given a set of triples \(T=\{t_i \mid t_i:<s,p,o>\}\), the predicate of all \(t_i\) that are value properties, define the set of value properties, denoted as: \(VP= \bigcup _{i=1}^{n}t_i.p \iff t_i.o \in L\), where n is the number of triples.    \(\blacklozenge \)

According to Definition 5, the value properties obtained from the triples of Listing 1 are: VP = {http://institutions.com/0.2/name, http://institutions.com/0.2/postal-Code, http://institutions.com/0.2/established}.

Definition 6

Literal Values ( LV ): Given a set of triples \(T=\{t_i \mid t_i:<s,p,o>\}\), the literals of all \(t_i\) define the set of literal values, denoted as: \(LV=\bigcup _{i=1}^{n}t_i.o \iff t_i.o \in L\), where n is the number of triples.    \(\blacklozenge \)

According to Definition 6, the literal values from Listing 1 are: VA={“Lycée de la Plage”, “64600”, “1985-05-19”, “Napoleon Business”, “64100”, “1986-12-19”, “École National de l’energie”, “64500”, “1984-11-21”, “Grande Ville School”, “64200”, “1977-08-22”}.

Table 5 summaries the sets of entities, entity and value properties, and literal values of an RDF document.

Table 5. Description of sets of data of an RDF document

The following section presents and describes our new serialization format S-RDF.

5 S-RDF: Our Proposal

Our proposal mainly relies on a three step process: (i) Extraction of RDF elements, where the input, an RDF document in any format, is analyzed in order to extract the set of entities (E), entity properties (EP), value properties (VP), and literal values (LV); (ii) RDF Sequence-Value generation where entities, properties, and literal values are represented by unique identifiers (e.g., primary keys); and (iii) RDF Sequence-Structure generation where relations among entities, which define the RDF graph structure, are expressed using the Sequence-Value Representation. Thus, our serialization format (S-RDF) consists in two parts: (i) Value Representation and (ii) Structure. Figure 1 shows the framework of our proposal composed by three modules that materialize the three respective phases.

Fig. 1.
figure 1

Framework of our serialization format “S-RDF”

In Definition 7, we formally describe the Value Representation part of our RDF sequence, called RDF Sequence–Value. This representation associates to each entity, entity and value property, and literal value a unique identifier to be used in the structure representation of the sequence. We propose four different identifiers to easily recognize the type of data in the second part of our sequence. The entities are represented by numbers of the decimal numeral system (base 10), starting from 1. In the case of entity and value properties, both identifiers correspond to the hexavigesimal numeral system (base 26), with a domain of lowercase and uppercase alphabet letters, respectively. For the literal values, the identifiers belong to the decimal numeral system as the ones of entities, but a symbol “_” is added as a prefix. For instance, the 28\(^{th}\) element of the entities is represented as “28”, “AB” for entity properties, “ab” for value properties, while for literal values is “_28”.

Definition 7

RDF Sequence–Value ( S-RDF-V ): Given a set of triples \(T=\{t_i \mid t_i:<s,p,o>\}\), its RDF Sequence-Value is defined as a 4-tuple of:

$$\begin{aligned}&S-RDF-V_T =<\\&\qquad \qquad Entities=\{{\bigcup }_{i=1}^{m}<pk_i,e_i,type_i>\},\\&\qquad \qquad Entity\_properties=\{{\bigcup }_{j=1}^{n}<pk_j,ep_j>\},\\&\qquad \qquad Value\_properties=\{{\bigcup }_{k=1}^{o}<pk_k,vp_k,datatype_k>\},\\&\qquad \qquad Literal\_values=\{{\bigcup }_{l=1}^{p}<\_pk_l,lv_l>\}> \end{aligned}$$

where:

  • Entities is a set of 3-tuples, where:

    \(*\) \(m \in Z^+\), is the size of E.

    \(*\) \(pk_i \in Z^+\), is a key that represents \(e_i\).

    \(*\) \(e_i \in E\) is an entity.

    \(*\) \(type_i\in \{1,2\}\), is the type of the entity \(e_i\) (1=IRI, 2=blank node).

  • Entity_properties is a set of 2-tuples, where:

    \(*\) \(n \in Z^+\), is the size of EP.

    \(*\) \(pk_j \in \{A...Z\}\), is a key that represents \(ep_{j}\).

    \(*\) \(ep_j \in EP\) is an entity property.

  • Value_properties is a set of 3-tuples, where:

    \(*\) \(o \in Z^+\), is the size of VP.

    \(*\) \(pk_k\in \{a...z\}\), is a key that represents \(vp_k\).

    \(*\) \(vp_k \in VP\) is a value property.

    \(*\) \(datatype_k\) is the datatype of the property.

  • Literal_values is a set of 2-tuples, where:

    \(*\) \(p \in Z^+\), is the size of LV.

    \(*\) \(\_pk_l\) is a key that represents \(lv_l\) and \(pk_l\in Z^+\).

    \(*\) \(lv_l \in LV\) is a literal value.

   \(\blacklozenge \)

Tables 6, 7, 8 and 9 represent the Entities, Entity_properties, Value_properties, and Literal_values of Listing 1. The first element of the S-RDF-V is composed by the entities of Table 6. As only one relation among entities is shown in Listing 1, the second element (entity properties) of the 4-tuple is: \(\{<\)A, http://www.w3.org/1999/02/22-rdf-syntax-ns#type\(>\}\). The third and fourth elements are composed by the information in Tables 8 and 9, respectively. The set of triples (T), obtained from Listing 1, has the following RDF Sequence–Value:

figure n
Table 6. Entities
Table 7. Entity properties
Table 8. Value properties
Table 9. Literal values

The S-RDF-V represents the entities, properties, and values of an RDF document, but a document also has information about the relations among entities and literal values (node-edge-node); thus, the second part of our serialization is dedicated to represent the RDF graph structure, called RDF Sequence–Structure. It consists of a 3-tuple, where the first element is composed of an entity; the second element has all entities, which are related to the first element, preceded by its respective entity property; and the last element is used to represent value properties and its respective literal values. The RDF Sequence–Structure is defined in Definition 8.

Definition 8

RDF Sequence-Structure ( S-RDF-S ): Given a set of triples \(T=\{t_i \mid t_i:<s,p,o>\}\), its RDF Sequence-Structure is defined as a set of 3-tuples:

figure o

For example, the set of triples (T) obtained from Listing 1, has the following RDF Sequence–Structure:

figure p

representing: entity “1” (http://institutions.com/0.2/S0991 according to Table 6), has an entity property “ A” (http://www.w3.org/1999/02/ 22-rdf-syntax-ns#type according to Table 7), related to the entity “5” (http://www.w3.org/2002/07/owl#Thing according to Table 6). It also has a property value “a” (http://institutions.com/0.2/name according to Table 8) with a literal value “_1” ("Lycée de la Plage" according to Table 9), and so on.

Once values and structure of the RDF data are defined, we formalize the whole RDF Sequence in Definition 9.

Definition 9

RDF Sequence ( S-RDF ): Given a set of triples \(T=\{t_i \mid t_i:<s,p,o>\}\), its RDF Sequence is a 2-tuple consisting of two parts, defined as:

$$S-RDF(T) = \,<\!\!S-RDF-V(T), S-RDF-S(T)\!\!>$$

where:

  • S-RDF-V(T) is the set of values of T defined in Definition 7.

  • S-RDF-S(T) is the structure of T defined in Definition 8.    \(\blacklozenge \)

The S-RDF is built to represent triples considering the structure and values separately. Thus, an analysis over either the data or structure can be easily performed. Another benefit of this serialization format is the easy detection of some graph properties as the number of relationships (e.g., degree centrality measure) with respect to other serialization formats. Moreover, the storage space is reduced, since an IRI, which appears several times in an RDF document as a resource or property, is represented as a unique short key (e.g., key:1 represents value: http://institutions.com/0.2/S0991 or key:A represents value: http://www.w3.org/2002/07/owl#Thing, respectively). This new serialization format can be consider as part of RDF partition strategies where the models improve the storage and the querying; however, when the repository is exported/outsourced, the format is still the same (e.g., RDF/XML, Turtle). Our serialization is a new way to represent data to be shared on the Web, improving the storage without losing the readability.

In the following section, we evaluate our S-RDF with respect to the current serialization formats.

6 Experimental Evaluation

6.1 Experimental Environment and Datasets

In order to evaluate and validate our serialization format, we developed a desktop and onlineFootnote 4 prototype system based on Java and JenaFootnote 5 to manage the RDF data. Experiments were undertaken on a MacBook Pro, 2.2 GHz Intel Core(TM) i7 with 16.00 GB, running a MacOS Mojave and using a Sun JDK 1.7 programming environment.

Our prototype was used to perform several experiments to evaluate the viability and the compression rate of our approach in comparison with the works proposed in the literature. To do so, we considered two datasets:

  • Data 1: the DBpedia person dataFootnote 6 with 16,842,176 triples; and

  • Data 2: the DBpedia geo coordinatesFootnote 7 with 151,205 triples.

Note that some of the serialization formats (e.g., RDFa, HDT++) described in the related work section were not evaluated since there are no tools available that can manage huge quantity of triples. They are mainly document oriented converters (e.g., Easy-RDFFootnote 8, RDF-TranslatorFootnote 9). For our readability test, HDT and HDT+ formats were analyzed since they have a binary representation and cannot be read by humans.

We describe as follows the tests performed to evaluate our proposal.

Table 10. Related work comparison for Data 1

6.2 Evaluation

Test 1: We chose randomly 50,000 triples from Data 1 in order to measure the compression rate of the data with respect to the size of the input (6,102,029 bytes). Table 10 shows the results obtained for this test. HDT serialization format clearly overcomes the other ones (82.3936%), since it was created to minimize the storage. However, our serialization has also a good result (71.6564%) without losing the human readability criterion as the binary representation of HDT does. JSON-LD serialization has the biggest compression rate (39.0276%) among the W3C recommendation formats.

For Data 2, we also chose 50,000 triples from this dataset, having a size of 7,356,637 bytes. Table 11 shows similar results as the ones of Data 1. HDT obtained the best result with 75.6508%, while for our serialization format was 70.7767%. The JSON-LD serialization format has a 59.6130% of compression rate with respect to the input size.

Table 11. Related work comparison for Data 2

Test 2: Since there is no benchmark model for readability available in the literature to compare the existing serialization formats, we propose three questions which are related to several aspects of the RDF structure. (i) The first question is about relations, which can help to the end-user to recognize some important nodes according to the context, (ii) the second one is related to the terminal nodes, and (iii) the third one to literal values. The questions are presented as follows:

  1. 1.

    Is the resource X the most related one of the data?

  2. 2.

    Is the resource Y a terminal node in the data?

  3. 3.

    How many literal values has the resource Z?

where X, Y, and Z are resources that belong to the set of triples used to evaluate this test (see Listing 2).

figure q

In this test, we evaluated our human readability criterion by surveying 40 people that have under- and post-graduate degrees in computer scienceFootnote 10. The participants evaluated the serialization formats through the three previous questions, choosing an option to answer them: Yes, No, and I do not know for the two first questions, and a value among 1 to 5 and “I do not know” option for the third one.

Table 12. Number of correct, incorrect, and ambiguous values of each question per serialization format

To evaluate the results, we calculated the F-measure, based on the Recall (R) and Precision (PR). These criteria are commonly adopted in information retrieval and are calculated as follows:

$$\begin{aligned} \mathbf {PR} = \dfrac{A}{A+B} \in \left[ 0,1 \right] \quad \,\, \mathbf {R} = \dfrac{A}{A+C} \in \left[ 0,1 \right] \quad \,\, \mathbf {F}{} \mathbf - \mathbf {measure} = \dfrac{2 \times PR \times R}{PR+R} \in \left[ 0,1 \right] \end{aligned}$$

where A is the number of correct answers; B is the number of wrong answers; and C is the number of “I do not know” options selected by the participants.

Table 12 shows the results obtained for this evaluation. For Question 1, the N3 serialization format obtained the best Precision (84.62%), while the one for our serialization format was 84.00%. RDF/XML and JSON-LD obtained the lowest Precision (22.73% and 36.36%, respectively). By regarding the F-measure, we can observe that Turtle, N3, and our proposal (S-RDF) help user to identify some graph properties as the centrality measure, since they obtained a high result (over 68.00%). For Question 2, which is related to identify terminal nodes, most of the serialization formats obtained a similar F-measure (\(\approx \)61.00%), but for the RDF/XML format, the F-measure was 43.14% due to the low Recall (35.48%). A low Recall can be interpreted as the serialization format is not easy-readable for the user. For Question 3, Turtle obtained the best F-measure (73.02%), while for S-RDF the value was 68.85%. By analyzing the answers, we noticed that some people confused the entity property and its respective value as a literal value since they only counted the number of elements associated to the entity.

Table 13 shows the global results of this test. In this table, we can identify two groups: G1: RDF/XML, N-Triples, and JSON-LD with a F-measure around 43.00%, and G2: Turtle, N3, and S-RDF with a value around 68.00%. One of the reasons of the low F-measure obtained by G1, is that these formats were created to keep the interoperability among system, using XML and JSON formats for example. The results demonstrate that our serialization format (S-RDF) can improve the storage without losing the human-readability criterion.

Table 13. Total number of correct, incorrect and ambiguous values per serialization format

7 Conclusion

In this paper, we propose a new serialization format, called S-RDF, which represents the RDF graph structure and values, separately. This format is focused on human readability, storage, and data redundancy to represent medium and large datasets. We evaluated our serialization format in terms of compression rate and human readability with respect to the state of the art. Results show a high compression without losing human readability, which is an advantage over the serialization formats created to minimize storage. According to the survey evaluation, our S-RDF allows identify easily the resources with more relations in the RDF graph (degree centrality measure) by identifying the entity with the bigger number of entity properties.

We are currently working on normalization methods over the S-RDF in order to provide a unique and deterministic output for similar inputs.