Keywords

1 Introduction

Since several years, the Web has evolved from a Web of linked documents to a Web of linked data. As a result, a huge amount of RDF statements, in the form of \(<Subject, Predicate, Object>\) triples, are currently available online through the Linked Open Data cloud thanks to different projectsFootnote 1(e.g., DBpedia, LinkedGeoData, New York Times, etc.). These triples are usually stored in RDF datasets after being serialized into several machine readable formats such as RDF/XML, N-Triple, Turtle, N3 or JSON-LD. Therefore, RDF descriptions can be represented in different ways and formats as shown in Fig. 1.

Fig. 1.
figure 1

RDF description (graph and serialization) examples

However, in different scenarios (e.g., collaborative RDF graph generation [6], automatic RDF serialization [12, 16], etc.), RDF descriptions might be verbose and contain several redundancies in terms of both: the structure of the graph and/or the serialization format of the resulting RDF-based file. For instance, graphs 1 and 2 in Fig. 1, which have been created by two different users, describe the same RDF information even though they have different structures (e.g., duplication of nodes and edges). Additionally, even more redundancies and disparitiesFootnote 2 will occur when serializing both graphs (e.g., the same RDF resource or blank node can be serialized in different ways, language and data-type declarations can be specified or omitted, namespaces can have different/duplicate prefixes, etc.). Consequently, such RDF redundancies and disparities would naturally have a negative impact on the processing of RDF databases (including storage, querying, mapping, annotating, versioning, etc.).

In this paper, we address the problem of cleaning RDF descriptions by introducing a new method as a means of transforming different RDF descriptions of the same RDF statements into one single (normalized) representation. Our method targets RDF modeling on two levels: (1) the structure/graph (logical) level, by eliminating redundancies in RDF graphs, which is typically useful in graph-based RDF querying, mapping, and versioning applications, and (2) the serialization (physical) level, by eliminating redundancies and disparities in the syntactic structure, and adapting it to the application domain, in order to optimize storage space and loading time.

The rest of this article is organized as follows. Section 2 reviews background and related work in RDF normalization. Section 3 presents motivating examples, highlighting different normalization features left unaddressed by most existing approaches. Our method for normalizing RDF descriptions (graphs and serializations) is detailed in Sect. 4. In Sect. 5, we present our prototype and illustrate experimental results. We finally conclude in Sect. 6.

2 Background and Related Work

2.1 Basic Notions

Definition 1

(Statement [st] ). An RDF statement expresses a relationship between two resources, two blank nodes, or one resource and one blank node. It is defined as an atomic structure consisting of a triple with a Subject (s), a Predicate (p) and an Object (o), noted st:\(<s,p,o>\), w.r.t. a specific vocabulary V (see Definition 3), where: (a) s \(\in \) U \(\cup \) BN is the subject to be described (U is a set of Internationalized Resource Identifiers and BN is a set of Blank Nodes), (b) p \(\in \) U refers to the properties of the subject, and (c) o \(\in \) U \(\cup \) BN \(\cup \) L is the object (L is a set of Literals) \(\blacklozenge \)

The example presented in Fig. 1.1.a underlines 4 statements with different resources, properties, and blank nodes such as \(st_{1}\): \(<\) http://www.univ-pau.fr, ex:nameProf, \(UX>\), \(st_{2}\): \(<UX\), ex:\(first\_name\), “Sebastien\(\hat{}\) \(\hat{}xsd:string>\), \(st_{3}\): \(<UX\), \(ex:last\_name\), “Durand\(>\), and \(st_{4}\): \(<\) http://www.univ-pau.fr, ex:lab, http://www.univ-pau.fr \(>\).

Definition 2

(RDF Graph [G] ). An RDG graph is a directed labeled graph made of a set of \(<\) s,p,o \(>\) statements in which each one is represented as a node-edge-node link \(\blacklozenge \)

In the remainder of the paper, “RDF graph” and “RDF logical representation” are used interchangeably.

Definition 3

(RDF Graph Vocabulary [V] ). An RDF Graph Vocabulary is the set of all values occurring in the RDF graph, i.e., V = U \(\cup \) L \(\cup \) BN \(\blacklozenge \)

Definition 4

(External Vocabulary [QN] ). An RDF External Vocabulary is a set of QNamesFootnote 3 (QN) to represent IRI references \(\{qn_1,qn_2,\)...\(,qn_n \}\). Each qn\(_{i}\) is a tuple \(<px_{i},ns_{i}>\) where px\(_{i}\) is a prefixFootnote 4 (e.g., foaf, ex, dc,...) and ns\(_{i}\) is a namespace \(\blacklozenge \)

For instance QN={(ex, http://example.org/stuff/1.0), (mypx, http://ucsp.edu.pe)}, where “ex” is a standard prefix, “mypx” is a local prefix, and http://example.org/stuff/1.0 and http://ucsp.edu.pe/ are namespaces.

Definition 5

(RDF File [F] ). An RDF file is defined as an encoding of either a set of RDF statements or an RDF graph, using a predefined serialization format (e.g., RDF/XML, Turtle, N3, etc.) \(\blacklozenge \)

In the remainder of the paper, RDF file, RDF serialization and RDF physical representation are used interchangeably. Also, the following functions R, U, L, BN, ST, NS and Px will be used respectively to return all the Resources (IRIs and literals), IRIs, Literals, Blank Nodes, statements, namespaces and prefixes of a graph G or a file F.

2.2 Related Work

The need for RDF normalization has been identified and discussed in various domains, ranging over knowledge representation, data integration, and service and semantic data mediation. Yet, few existing studies have addressed the issues of logical (graph) and physical (syntax) RDF normalization.

Knowledge Representation, Integration and Semantic Mediation. Various approaches have been developed to normalize knowledge representation in RDF, namely in the bioinformatics domain [1, 6, 10, 11, 14]. In [14], the authors provide an approach to map LexGrid [11], a distributed network of lexical resources for storing, representing and querying biomedical ontologies and vocabularies, to various Semantic Web (SW) standards, namely RDF. They introduce the LexRDF project which leverages LexGrid, mapping its concepts and properties to standard - normalized - RDF tagging, thus providing a unified RDF based model (using a common terminology) for both semantic and lexical information describing biomedical data. In a related study [10], the authors introduce the Bio2RDF project, aiming to create a network of coherent linked data across life sciences databases. The authors address URI normalization, as a necessary prerequisite to build an integrated bioinformatics data warehouse on the SW, where resources are assigned URIs normalized around the bio2rdf.org namespace. Several approaches have developed semantic mediators (translators), in order to convert information from one data source to another following the data format which each system understands [7, 8]. Most studies consider the original data to be well organized (normalized), thus the resulting RDF data would allegedly follow. Note that in most of these projects, issues of redundancies in RDF logical and syntax representations are mostly left unaddressed.

RDF Graph (Logical) Normalization. In [5], the authors discuss some of the redundancies which can occur in a traditional RDF directed labeled graphs. Namely, an RDF graph edge label (i.e., a predicate) can occur redundantly as the subject or the object of another statement (e.g., \(<dbpedia:Researcher, dbpedia:Workplace, dbpedia:University>\) and \(<dbpedia:Workplace, rdf:type, dbpedia:Professional>\)). Hence, the authors in [5] introduce an RDF graph model as a special bipartite graph where RDF triples are represented as ordered 3-uniform hypergraphs where edge nodes correspond to the \(<subject, predicate, object>\) triplet constituents, ordered following the statement’s logical triplet ordering. The new model is proven effective in reducing the predicate-node duplication redundancies identified by the authors. In subsequent studies [3, 4], the authors address the problem of producing RDF normal forms and evaluating the equivalence among them. The studies in [3, 4] specifically target the RDFS vocabulary with a set of reserved words to describe the relationships between resources (e.g., rdfs:type, rdfs:range, rdfs:domain, etc.). They provide a full-fledged theoretical model including notions such as: RDF lean (minimal) graph as a graph preserving all URIs of its origin graph while having fewer blank nodes, and RDF normal form as a minimal and unique representation of an RDF graph, among others. In [9, 13], the authors introduce an algorithm allowing to transform an RDF graph into a standard form, arranged in a deterministic way, generating a cryptographically-strong hash identifier for the graph, or digitally signing it. The algorithm takes a JSON-LD input format, and provides an output in N-triple serialization while relabeling certain nodes and erasing certain redundancies. Our approach completes the latter studies by targeting logical (graph) redundancies being out of the scope of [35, 9, 13], namely distinct edge (predicate) duplication, blank node (subject/object) duplication, and combined edge and node duplication.

RDF Syntax (Physical) Normalization. At the physical (syntactic) level, the authors in [16] introduce a method to normalize the serialization of an RDF graph using XML grammar (DTD) definitions. The process consists of two steps: (a) Defining an XML grammar (DTD) to which all generated RDF/XML serializations should comply, (b) Defining SPARQL query statements to query the RDF dataset in order to return results, consisting of serializations compliant with the grammar (DTD) at hand. This is comparable to the concept of semantic mediation using SPARQL queries [8]. Note that the SPARQL statements are automatically generated based on the grammar (DTD).

To sum up, our approach completes and builds on existing methods to normalize RDF information, namely [35, 9, 13, 16], by handling logical and physical redundancies and disparities which were (partially) unaddressed in the latter.

3 Motivating Example

We discuss here the motivations of our work, highlighting two different levels: (i) logical redundancies, and (ii) physical redundancies and disparities.

3.1 Logical (Graph) Redundancies

Consider the example given in Fig. 1.a.2. One can easily see several kinds of redundancies:

  • Problem 1 - Edge Duplication: where identical edges, designating identical RDF predicates, appear more than once,

  • Problem 2 - Node duplication: where identical nodes, designating identical RDF subjects and/or objects, appear more than once, e.g., in Fig. 1.a.2 highlighting Blank Node duplication and Literal duplication respectively.

3.2 Physical (Serialization) Disparities

Consider now Fig. 2.a which represents a possible serialization of the RDF graph in Fig. 1.a.2 encoded in the RDF/XML format. One can see that several types of redundancies and disparities are introduced: some are inherited from the logical level, while others appear at the physical (serialization) level:

Fig. 2.
figure 2

RDF/XML serializations

  • Problem 3 - Namespace duplication: where two different prefixes are used to designate the same namespaces (ex and ex1 in lines 4–5 in Fig. 2.a),

  • Problem 4 - Unused namespace: where the namespaces are declared but never called in the body of the document (dc in line 3 in Fig. 2.a).

  • Problem 5 - Node order variation: i.e., node siblings in the RDF description might be ordered differently when serialized (e.g., nodes in lines 6–16 in Fig. 2.b follow the order of appearance of XML elements, which can be re-ordered differently in another serializations).

  • Problem 6 - Handling typed elements: objects (literals) elements can be typed or not (lines 9–12 in Fig. 2.b).

  • Problem 7 - Handling language tags: Distinguishing between identical literals having different language tags (lines 10–11 in Fig. 2.b).

One can clearly realize the compound effect of missing logical and physical normalization when contrasting the serialization in Fig. 1.b.1 with those in Fig. 2, all of which represent the same RDF information (cf. Sect. 1). In the following, we develop our proposal addressing the above problems. Due to space limitations, only RDF/XML format is used in what follows to illustrate RDF serialization results.

4 RDF Normalization Proposal

We first provide a set of definitions, rules, and properties before developing our process.

4.1 Definitions

Definition 6

(Extended Statement [st \(^{+}\) ] ). An extended statement is a more expressive representation of a statement (st), denoted as: st \(^{+}\):\(<\) s \('\) ,p \('\) ,o \('>\) where:

  • s\('\):\(<s, ts>\) is composed of the subject value (s) and its type (ts \(\in \{u, bn\}\))

  • p\('\):\(<p, dt, lng>\) is composed of the predicate value (p), its datatype dt \(\in \) DTFootnote 5 \(\cup \) \(\{\bot \}\), and the language tag lng \(\in \) LangFootnote 6 \(\cup \) \(\{\bot \}\). \(\bot \) represents a “null” value.

  • o\('\):\(<o, to>\) is composed of the object value (o) and its type to \(\in \{u, bn, l\}\)

The following notation is adopted to represent an extended statement:

$$st^{+}: <s_{ts \in \{u,bn\}}, p_{dt}^{lng}, o_{to \in \{u,bn,l\}}\blacklozenge $$

Based on the example of Fig. 1.1.a, \(st_{1}\) becomes \(st_{1}^{+}\): \(<\) http://www.univ-pau.fr \(_{u}\), ex:nameProf\(_{\bot }^{\bot }\), UX\(_{bn}\) \(>\). The function ST \(^{+}\) will be used in the following to return all the (extended) statements of an RDF Description.

Definition 7

(Extended Statement Equality [ \(=_{st}\) ] ). Two extended statements \(st_{i}^{+}\) and \(st_{j}^{+}\) are equal, noted \(st_{i}^{+}=_{st} st_{j}^{+}\), if and only if: (1) \(st_{i}^{+}.s'\) = \(st_{j}^{+}.s'\), (2) \(st_{i}^{+}.p'\) =\(st_{j}^{+}.p'\), and (3) \(st_{i}^{+}.o'\) = \(st_{j}^{+}.o'\) \(\blacklozenge \)

In Fig. 3.a, \(st_{3}^{+} =_{st} st_{4}^{+}\) since they share the same subject (i.e., \(u_{1}\)), the same predicate (i.e., \(p_{4}\)), and the same object (i.e., \(u_{2})\).

Fig. 3.
figure 3

RDF statement examples

Definition 8

(Outgoings [O] ). Given an extended RDF statement \(st_{i}^{+}\), the outgoings of \(st_{i}^{+}\), noted \(O(st_{i}^{+}\)), designate the set of extended statements having for subject the object element \(o_{i}'\) of \(st_{i}^{+}\):

$$ O(st_{i}^{+}:<s_{i}', p_{i}', o_{i}'>) = \{st_{j}^{+}:<(o_{i}', p_{j}', o_{j}'>, ..., st_{n}^{+}:<o_{i}', p_{n}', o_{n}'> \}\blacklozenge $$

In Fig. 3.b, we identify the following outgoings of \(st_{1}^{+}\): \(O(st_{1}^{+}) = \{st_{5}^{+},st_{6}^{+}\}\) = \(\{<{bn\_1}_{bn}, {p_{2}}^{\bot }_{string}, {l_{1}}_{l}>,\) \(<{bn\_{1}_{bn}, {p_{3}}^{fr}_{\bot }, {l_{2}}_{l}>\}}\)

Definition 9

(Statement Containment [ \(\preceq \) ] ). An extended statement \(st_{i}^{+}\) is said to be contained in another extended statement \(st_{j}^{+}\), noted \(st_{i}^{+} \preceq st_{j}^{+}\), if: (1) O(\(st_{i}^{+}\)) \(\subseteq \) O(\(st_{j}^{+}\)), (2) both \(st_{i}^{+}\) and \(st_{j}^{+}\) have the same subject and predicate, and (3) the object type (to) in both statements is a blank node (bn) \(\blacklozenge \)

In Fig. 3.bFootnote 7, \(st_{1}^{+} \preceq st_{2}^{+}\) since they share the same subject (i.e., \(s_{1}\)) and the same predicate (i.e., \(p_{1}\)), and have O(\(st_{1}^{+}\)) \(\subseteq \) O(\(st_{2}^{+}\)) w.r.t. their outgoings.

Definition 10

(RDF Equality [ \(=_{RDF}\) ] ). Two RDF graphs \(G_{i}\) and \(G_{j}\) are equal, \(G_{i}\) \(=_{RDF}\) \(G_{j}\), if all the statements of \(G_{i}\) occurs in \(G_{j}\) and vice versa. Similarly, two RDF files \(F_{i}\) and \(F_{j}\) are equal, \(F_{i}\) \(=_{RDF}\) \(F_{j}\), if: (1) the corresponding graphs are equal, and (2) \(F_{i}\) has the same encoding format (\(e_{i}\)) as \(F_{j}\) in (\(e_{j}\)) \(\blacklozenge \)

4.2 Normalization Rules

In this section, we provide a set of rules to resolve the problems listed in Sect. 3.

Solving Logical Redundancies. Given an input RDF graph G, logical redundancies related to node duplication, edge duplication and node/edge duplications (presented in Sect. 3.1) can be eliminated from G by applying the following transformation rules (proofs are provided in [15]):

Rule 1 - Statement Equality Elimination: It is designed to eliminate edge duplications and/or node duplications using equality between statements. More precisely, \(\forall st_{i}^{+}, st_{j}^{+} \in ST^{+}(G) \wedge i \ne j\), if \( st_{i}^{+} =_{st} st_{j}^{+} \Longrightarrow \) remove \(st_{j}^{+}\). Applying this rule on G produces an RDF Graph G’ where G’ \(=_{RDF}\) G.

Rule 2 - Statement Containment Elimination: It is designed to eliminate edge duplications between IRIs or blank nodes in the outgoing statements, and node duplications where the objects of extended statements are blank nodes and linked to other outgoing statements). More precisely: \(\forall st_{i}^{+}, st_{j}^{+} \in ST^{+}(G) \wedge i \ne j\), if \(st_{j}^{+} \preceq st_{i}^{+} \Longrightarrow \) remove \(st_{j}^{+} \cup O(st_{j}^{+})\). Applying Rule 2 on G produces another RDF graph G’ where G’ \(=_{RDF}\) G.

Solving Physical Disparities. Given an input RDF file F, physical disparities related to namespace duplication, unused namespaces, and node order variation (presented in Sect. 3.2) can be eliminated from F by applying the following transformation rules:

Rule 3 - Namespace Duplication Elimination: It is designed to eliminate namespace duplications along with corresponding namespace prefixes such as: \(\forall qn_{i}, qn_{j} \in QN(F) \wedge i \ne j\), if \( qn_{i}.ns_{i} = qn_{j}.ns_{j}\) \(\Longrightarrow \) remove \(qn_{j}\) \(\wedge \) replace \(qn_{j}.px_{j}\) by \(qn_{i}.px_{i}\) in (F). Applying Rule 3 on F produces another equal RDF file since duplicated \(qn_{j}.ns_{j}\) and its corresponding prefix \(qn_{j}.px_{j}\) have been removed, while replacing \(qn_{j}.px_{j}\) by \(qn_{i}.px_{i}\) in the whole file.

Rule 4 - Unused Namespace Elimination: It is designed to eliminate the unused namespacesFootnote 8 with their respective prefixes such as: \(\forall qn_{i}.ns_{i} \in NS^{+}(F)\), if \(qn_{i}.ns_{i} \notin NS^{+}(G) \Longrightarrow \) remove \(qn_{i}\). Applying Rule 4 on F produces another equal file where unused \(qn_{i}.ns_{i}\) and respective prefix \(px_{i}\) have been removed.

Rule 5 - Reordering: It is designed to solve the varying node order problem by imposing a predefined (user-chosen) order on all statements of an RDF File F. More formally: \(\forall st_{i}^{+}, st_{j}^{+} \in ST^{+}(F) \Longrightarrow st_{i}^{+} <_{\tilde{p}} st_{j}^{+}\) where \(\tilde{p}:<iorder, sortc>\) is a tuple composed of an indexing order “iorder” and a sorting criterion “sortc”. For iorder, we follow the six indexing schemes presented in [17] (SPO, SOP, PSO, POS, OSP, OPS) since it is the combination of the three elements of the statement (subject, predicate, object) in an RDF Description. For sortc, we adopt asc and des to represent ascending and descending order respectively. The default value for the parameter \(\tilde{p}\) is \(<sop, asc>\) representing an ascending order of statements w.r.t. their subjects / objects / predicates (sop). Applying Rule 5 on F produces an equal RDF file where all the statements have been ordered following the (user-chosen) order type parameter \(\tilde{p}\).

4.3 Normalization Properties

Our approach verifies the following properties characterizing the quality of the normalization process. This corresponds to the notion of Information Reusability discussed in existing studiesFootnote 9. Proofs are provided in [15].

Property 1

(Completeness). An RDF Description \(D_{i}\) (graph or file) is said to be complete regarding \(D_{j}\) if \(D_{i}\) preserves and does not lose any information in \(D_{j}\), i.e., each resource, statement and namespace of \(D_{j}\) has a corresponding resource, statement and namespace in \(D_{i}\) \(\bullet \)

Property 2

(Minimality). An RDF Description \(D_{i}\) is said to be minimal if all the resources, statements and namespaces of \(D_{i}\) are unique (i.e., they do not have duplicates in \(D_{i}\)) and all the namespaces are used \(\bullet \)

Property 3

(Compliance). An RDF File \(F_{i}\) is said to be compliant if it verifies all the rules of the W3C standard in producing a valid file based on an RDF Graph \(G_{i}\), i.e., its structure remains compliant with RDF serialization standards (e.g., RDF/XML) \(\bullet \)

Property 4

(Consistency). An RDF Description \(D_{i}\) is said to be consistent if \(D_{i}\) verifies the completeness, minimality and compliance properties to ensure the data quality of the description \(\bullet \)

4.4 Normalization Process

The inputs of our RDF normalization process are: (a) an RDF graph (logical representation) or an RDF file (physical representation) to be normalized, and (b) user parameters related to the RDF output form and prefix renaming, enabling the user to tune the results according to her requirements (see Fig. 4). In the following, we detail each step of the normalization process.

Fig. 4.
figure 4

Architecture of our RDF normalization process

Logical Normalization. Algorithm 1 provides the pseudo-code to remove redundancies from an RDF graph. It starts by erasing the statement redundancies having IRIs or literals. Consequently, it removes the equal statements as well as contained ones with duplicated blank nodes (bn) (and all the outgoings O derived of the bn) following Definition 9.

figure a

Physical Normalization. It is divided into three components based on the types of physical disparities being processed:

  1. 1.

    Namespaces Controller (NC): it controls namespace duplication by eliminating the redundancies (Rule 3) and the unused namespaces (Rule 4) in the RDF File. The input of this component is the prefix renaming parameter, which allows to customize the renaming of the prefixes while providing a unique way to normalize them.

  2. 2.

    Sorting Process (SP): Rule 5 establishes the node order variation, to have a unique specification of the statements in the output serialization. It considers the parameter \(\tilde{p}:<iorder, sortc>\) given by the user according to the targeted applications.

  3. 3.

    Formatting Process (FP): it allows to: (a) choose a specific form for the output RDF file, (b) manage the variety of blank node serializations, and (c) manage the datatypes and the languages. Our current solution allows three different output forms (other forms could be devised based on user/application requirements):

    Flat: it develops each RDF description one by one as a single declaration, i.e., each subject has one declaration in the file.

    Compact: it nests the RDF description, i.e., each description may have another description nested in its declaration.

    Full compact: dedicated to RDF/XML format, it nests the RDF description, uses the ENTITY XML construct to reduce space by providing an abbreviation for IRIsFootnote 10, reuses the variables in the RDF file, and uses attributes instead of properties for the blank node serialization.

    Providing different output types is necessary to satisfy the requirements of different kinds of RDF-based applications (e.g., compact representations are usually of interest to human users when running RDF queries [4], yet less compact/more structured representations could be useful in automated processing, such as in RDF annotation recommendation [12]).

5 Experimental Evaluation

5.1 Experimental Environment

An online prototype system: RDF2NormRDF Footnote 11, implemented using PHP and Java, was developed to test, evaluate and validate our RDF Normalization approach. It was used to perform a large battery of experiments to evaluate: (i) the effectiveness (in performing normalization), and (ii) the performance (execution time) and storage space of our approach in comparison with its most recent alternatives. To do so, we considered three large datasets: (i) Real_DS made of real RDF files from LinkedGeoData, (ii) Syn_DS1 made of synthetic RDF files generated from the real dataset by including additional random logical redundancies and physical disparities, and (iii) Syn_DS2 consisting of a variation of Syn_DS1 files including more heterogeneity (more duplications and disparities) than Syn_SD1 (cf. Table 1).

Table 1. Features of files in each dataset

Experiments were undertaken on an Intel®Core(TM) i7 - 2600 + 3.4 GHz with 8.00 GB RAM, running a MS Windows 7 Professional OS and using a Sun JDK 1.7 programming environment.

5.2 Experimental Results

We evaluated the effectiveness and performance of our method, called in this section R2NR, in comparison with alternative methods, namely the JSON-LD normalization approach [9, 13] and the HDT technique [2].

Table 2. Goals/properties achieved in datasets after normalization processes
  1. 1.

    Effectiveness: Results in Table 2 show that our method produces normalized RDF files that fulfill all our normalization properties and goals, in comparison with JSON-LD and HDT which miss certain logical and physical redundancies/disparities. On one hand, the JSON-LD method preserves some of the redundancies, i.e., JSON-LD removes only 5 % over a 32 % average of logical redundancies in Syn_DS1 and 12 % over a 64 % average of logical redundancies in Syn_DS2. However, it removes all logical redundancies from the Real_DS (all 57 % of logical redundancies and 48 % of physical disparities were removed). A careful inspection of JSON-LD shows that it preserves blank node duplication and certain literal duplications, which explains the results obtained with the Real_DS (since it does not contain any blank nodes). As a result, the JSON-LD approach does not satisfy the minimality and consistency properties. On the other hand, the HDT method results show, at first glance, that it successfully eliminates all logical redundancies and physical disparities. Nonetheless, a closer look at the results revealed that the HDT technique actually preserves blank node redundancies by assigning them different identifiers and/or representing them as IRIs. Hence, the HDT method actually keeps the logical redundancies (and corresponding physical disparities) and does not consequently satisfy the completeness, minimality and consistency properties.

  2. 2.

    Performance: In addition, we verified our approach’ performance in terms of loading time and storage space, in comparison with JSON-LD and HDT.

    (a) Jena loading time: The complexity of our method comes down to worst case \(O(N^{2})\) time where N represents the number of RDF statements in the document (cf. [15]). Jena loading time results depicted in Figs. 5 and 6 confirm our approach’ polynomial (almost linear) time w.r.t. document size. Also, results in Figs. 5.a and 5.b demonstrate that our method executes faster than JSON-LD’s. Note that redundancy reduction using of JSON-LD amounts to 5 % on average file size in Syn_DS1, while our method reaches an average 27 % size reduction ratio, which explains the reduction in loading time. In addition, results in Figs. 6.a and 6.b show that our method remains also faster than the HDT method. In fact, as shown in Table 2, the datasets generated by HDT do not have redundancies and disparities, yet contain a larger number of IRIs with no (zero) BNs. This confirms that HDT is transforming BNs into IRIs, which shows that RDF compression does not always guarantee normalization. Note that we are currently investigating this issue in more details in a dedicated experimental study.

    (b) Storage: Neither JSON-LD nor HDT methods provide parameters to customize output format requirements as we do. They work with their predefined outputs, i.e., the JSON-LD method with N-triples, and the HDT method with Bitmap Triples (BT), in comparison with our method which handles the standard formats and thus allows developing different outputs w.r.t the target application. Figure 7 shows that our normalization method improves (reduces) the size of the RDF files in all formats of the datasets processed by JSON-LD and HDT methods.

Fig. 5.
figure 5

Average Jena loading time comparison with JSON-LD method

Fig. 6.
figure 6

Average Jena loading time comparison with HDT method

Fig. 7.
figure 7

Average reduction in datasets Syn_DS1 and Syn_DS2 w.r.t. the output format

6 Conclusion

We proposed here an RDF normalization method RDF2NormRDF able to: (i) preserve all the information in RDF descriptions, (ii) eliminate all the logical redundancies and physical disparities in the output RDF description, (iii) establish a unique specification of the statements in the RDF output description, and (iv) consider user parameters to handle the application requirements. To our knowledge, this is the first attempt to study and integrate RDF normalization in two aspects: logical redundancies and physical disparities. Understanding that the presence of logical redundancies in the RDF descriptions causes a heavy impact in the storage and processing of information, our theoretical proposal and experimental evaluation showed that our approach yields improved normalization results with respect to existing alternatives. Ongoing works include exploiting semantic normalization to improve, not only the structure of RDF descriptions, but also their information content (as IRIs, blank nodes, and literals). Future directions also include the semantic disambiguation of RDF literals and IRIs.