1 Introduction

The Resource Description Framework (RDF) [30] provides a simple scheme for structuring and linking data that describe facts of the world [10]. It models knowledge in the form of triples (subject, predicate, object), in which the subject represents the resource being described, the predicate is the property, and the object contains the value associated with the property for the given subject. RDF was originally conceived (under a document-centric perspective of the Web) as a foundation for processing metadata and describing resources. However, this conception does not address its actual usage. The current RecommendationFootnote 1 considers RDF as the key to do for machine processable information (application data) what the WWW has done for hypertext, that is, to allow data to be processed outside the particular environment in which it was created, in a fashion that can work at Internet scale. This statement describes the RDF status in the so-called Web of Data.

The Web of Data materializes the basic principles of the Semantic Web [8] and interconnects datasets from diverse fields of knowledge within a cloud of data-to-data hyperlinks that enables a ubiquitous and seamless data integration at the lowest level of granularity. Thus, information follows a data-centric organization within the Web of Data. The advances in extraction mechanisms [22] or the annotation of massive amounts of resources [42], among others, have motivated the growth of the Web of Data in which the number (and scale) of semantic applications in use increases, more RDF data are linked together, and increasingly larger datasets are obtained. This popularity is the basis for the development of RDF management systems (referred to as RDF stores), which play a central role in the Web of Data. They provide support for RDF storage and also lookup infrastructure to access it via SPARQL [36] query interfaces. Although the increased amount of RDF available is good for semantic initiatives, it is also causing performance bottlenecks in the RDF stores currently used [26]. Thus, scalability arises as a major issue, restricting popular RDF applications, like inference-based ones, because traditional solutions are not suitable for large-scale deployments [28]. The scalability management is closely related to the RDF physical organization, storage, and the mechanisms designed for its retrieval.

Two families of RDF stores are mainly used within the current Web of Data. On the one hand, relational stores are built on the generality, scalability, and maturity of relational databases. Some logical models have been proposed for organizing and storing RDF in relational stores [38]. However, the relational model is quite strict to fit semi-structured RDF features, and these systems reach only limited success. On the other hand, native solutions are custom designed from scratch and focus directly on specific RDF features. Although various techniques have been proposed, multi-indexing ones are the most frequently used in the current state of the art [6, 34, 48]. Even so, these approaches also suffer from lack of scalability because they raise huge space requirements.

We address this scenario with one main guideline: to reduce the space required for organizing and storing RDF. Space savings also have a significative impact in processing times because larger datasets can be managed and queried in main memory, thus I/O costs are reduced. Our approach, called \(\hbox {k}^2\)-triples, leverages this fact to manage larger RDF datasets in main memory.

\(\hbox {k}^2\)-triples revisits vertical partitioning [3] by replacing relational storage by compact \(\hbox {k}^2\)-tree structures [13]. The \(\hbox {k}^2\)-tree provides indexed access to binary matrices and excels in compression when these matrices are very sparse. This case arises when an RDF dataset is vertically partitioned, because the number of subjects related to object–predicate pairs and the number of objects related to subject–predicate pairs are very few for real-world datasets [18]. This fact not only yields space efficiency, outperforming the space of the best state-of-the-art alternatives by a factor of 1.5–12. The \(\hbox {k}^2\)-tree representation also enables efficient RDF retrieval for triple patterns, which are the basic SPARQL queries. Our representation is up to 5–7 orders of magnitude faster than the state of the art to solve most triple patterns.

Our basic \(\hbox {k}^2\)-triples approach is further enhanced with additional compact data structures to speed up the processing of some SPARQL queries, in particular those with no fixed predicate. This is the main weakness of vertical partitioning and is directly related to the number of different predicates used for modeling a dataset. We define two compact indexes that list the predicates related to each subject and to each object in the dataset. These structures involve an acceptable space overhead (less than 30 % of the original space requirements) and improves performance by more than an order of magnitude when these classes of queries are solved on a dataset comprising a large predicate set.

We also focus on join operations, as they are the basis for building the Basic Graph Patterns (BGPs) commonly used in SPARQL. We optimize the traditional merge and index join algorithms to take advantage of the basic retrieval functionality provided in \(\hbox {k}^2\)-triples. Besides, we describe an innovative join algorithm that traverses several \(\hbox {k}^2\)-trees in parallel and reports excellent results in many practical scenarios. Our technique sharply outperforms traditional vertical partitioning in join solution, by up to \(5\) orders of magnitude in joins involving any variable predicate. The comparison with more advanced techniques shows that \(\hbox {k}^2\)-triples performs up to 3 orders of magnitude faster in joins providing the non-joined nodes, while remaining competitive in the others.

Our experiments compare \(\hbox {k}^2\)-triples with several state-of-the-art alternatives on various real-life RDF datasets, considering space and query time. In summary, \(\hbox {k}^2\)-triples achieves the most compressed RDF representations to the best of our knowledge, representing \(\approx 200,000\) triples/MB in our largest dataset (dbpedia), where the next best techniques in space usage (MonetDB and RDF-3X, in this order) represent \(125,000\) and \(25,000\) triples per MB. When solving basic triple patterns, our enhanced structure requires 0.01–1 millisecond (msec) per query on dbpedia, whereas the next fastest alternative (RDF-3X) takes 1 to 10 msec and MonetDB takes 100 msec to 1 second. Finally, our best numbers in join solution range from 0.01 to 10 msec per query (also on dbpedia), whereas RDF-3X always requires over 10 msec and MonetDB spends more than 1,000 s in the same cases. Generally, our times are below 0.1 seconds per query, which is comparable to the best performance reported in the state of the art using RDF-3X.

The paper is organized as follows. The next section gives basic notions about RDF and SPARQL and reviews the state of the art on RDF stores. Section 3 introduces compact data structures and details the \(\hbox {k}^2\)-tree index used as the basis for our approach. The next three sections give a full description of our approach: Sect. 4 explains how \(\hbox {k}^2\)-trees are used for physical RDF organization and storage, Sect. 5 describes the mechanisms designed for basic RDF retrieval over this data partitioning, and Sect. 6 details the join algorithms designed as the basis for BGP solution in SPARQL. Section 7 experimentally compares \(\hbox {k}^2\)-triples with state-of-the-art RDF stores on various real-world datasets, focusing both in space usage and query time performance. Finally, Sect. 8 gives our conclusions about the current scope of \(\hbox {k}^2\)-triples and devises its future evolution within the Web of Data.

2 State of the art

The marriage of RDF and SPARQL is a cornerstone of the Web of Data because they are the standards recommended by the W3C for describing and querying semantic data. Both are briefly introduced to give basic notions about their use and properties.

As previously described, RDF [30] provides a description framework for structuring and linking data in the form of triples (subject, predicate, object). A triple can also be seen as an edge in labeled graph, \(\mathtt{S}\mathop {\rightarrow }\limits ^{\mathtt{P}} \mathtt{O}\), where the subject S and the object O are represented as vertices and the predicate P labels the edge that joins them. The graph modeling the whole triple set is called the RDF graph, a term formally introduced in the RDF Recommendation [30].

Figure 1 models in RDF some information related to the Spanish National Soccer Team (hereinafter referred to as Spanish Team) and some of its players.Footnote 2 Two equivalent notations are considered: (a) enumerates the set of triples representing this information, whereas (b) shows its equivalent graph-based representation. Following the triples representation in (a), we first state that the Spanish Team represents Spain and Madrid is the capital of Spain. Then, we describe the player Iker Casillas: He was born in Madrid, plays for the Spanish Team in the position of goalkeeper, and he is also the team captain. Finally, both Iniesta and Xavi play for the Spanish Team in the position of midfielder. These same relations can be found by traversing the labeled edges in the graph (b).

Fig. 1
figure 1

Example of RDF-based modeling

SPARQL [36] is the W3C Recommendation for querying RDF. It is a graph-matching language built on top of triple patterns, that is, RDF triples in which each subject, predicate, or object may be a variable. This means that eight different triple patterns are possible in SPARQL (variables are preceded, in the pattern, by the symbol ?): (S,P,O), (S,?P,O), (S,P,?O), (S,?P,?O), (?S,P,O), (?S,?P,O), (?S,P,?O), and (?S,?P,?O).

SPARQL builds more complex queries (generically referred to as Basic Graph Patterns, BGPs) by joining sets of triple patterns. Thus, competitive SPARQL engines require, at least, fast triple pattern solution and efficient join methods. Additionally, query optimizers are required to build efficient execution plans that minimize the amount of intermediate results to be joined in the BGP. Query optimization is orthogonal to our work; such techniques can be implemented on top of \(\hbox {k}^2\)-triples.

Fig. 2
figure 2

Examples of a SPARQL triple pattern and b SPARQL basic graph pattern

Figure 2 shows two simple SPARQL queries over the RDF excerpt described in the example above:

  • The first query (left), expressed in SPARQL syntax on the left of the figure, represents the triple pattern (?S,P,O). It asks for all subjects related to the Spanish Team through the predicate playFor. From a structural perspective, this query is a subgraph comprising two nodes connected through the edge labeled playFor: The destination node represents the element Spanish Team, whereas the source node is a variable. This way, the query solution involves graph pattern matching for locating all nodes that can play the source role in this query subgraph. In this case, the valid nodes represent the players Iker Casillas, Iniesta, and Xavi.

  • The second query restricts the previous one to retrieve only the midfielder players of the Spanish Team. This refinement is implemented through a second triple pattern (?S,P,O) setting the predicate position and the object midfielder. As can be seen on the right figure, the two triple patterns of the query are joined by their subject. Its solution matches the query subgraph to the RDF graph and retrieves the elements conforming to the variable node; in this case, the result contains the players Iniesta and Xavi.

RDF is defined as a logical data model, so no restrictions are posed on its physical representation and/or storage. However, its implementation has a clear effect on the retrieval efficiency and, therefore, on the success of a SPARQL engine within an RDF store. We review below the existing techniques for modeling, partitioning, and indexing RDF, and discuss their use in some real RDF stores. We aim to show the achievements and shortcomings in the state of the art to highlight the potential for improvement on which our work focuses. We first describe the approaches based on a relational infrastructure, and then, the solutions natively designed for handling RDF.

2.1 Relational solutions

Some logical schemes have been proposed for representing RDF over the infrastructure provided by relational databases, but their success has been limited due to the strictness of the relational model for handling the semi-structured RDF features. Nevertheless, there is still room for optimization in the field of relational solutions for RDF management [38]; we describe below the most frequently used schemes.

Single three-column table This is the most straightforward scheme for modeling RDF over relational infrastructure. It represents RDF triples as generic tuples in a huge three-column table, so BGP solution involves many expensive self-joins on this table. Systems such as 3store [23] or the popular Virtuoso [47] implement this scheme.

Property tables This model arises as a more practical scheme for RDF organization in relational databases because it creates relational-like property tables out of RDF data. These tables gather together information about multiple predicates (properties) over a list of subjects. Thus, a given property table has many columns, since different predicates (one per column) are used for describing the subjects it stores (in rows). Although this model significantly reduces the number of self-joins, the cost of the query solution remains high. Besides, the use of property tables induces two additional problems. On the one hand, storage requirements increase because NULL values must be explicitly stored, in each tuple, if the represented subject is not described for a given property in the table. On the other hand, multi-valued attributes are abundant in semantic datasets and they are somewhat awkward to express in property tables [3]. Thus, property tables are a competitive choice for representing well-structured datasets, but they lose potential in a general case. Systems like Jena [49] or Sesame [15] use property tables for modeling RDF.

Vertical partitioning The vertical partitioning (VP) scheme [3] can be seen as a specialized case of property tables in which each table gathers information about a single predicate. This way, VP uses as many tables as different predicates are used in the dataset, each one storing tuples (S,O) that represent all (subject, object) pairs related through a given predicate. Each table is sorted by the subject column, in general, so particular subjects can be located quickly, and fast merge joins can be used to reconstruct information about multiple properties for subsets of subjects [3]. However, this decision penalizes queries by object, which require additional object indexing for achieving competitive performance.

VP-based solutions avoid the weaknesses previously reported for property tables because only non-NULL values are stored, and multi-valued attributes are listed as successive tuples in the corresponding table. However, VP-based solutions suffer from an important lack of efficiency for solving queries with unbounded predicate; in this case, all tables must be queried and their results must then be merged to obtain the final result. This cost increases linearly with the number of different predicates used in the dataset, so VP is not the best choice for representing datasets with many predicates.

Abadi et al. [1, 3] report that querying performance in column-oriented databases is up to one order of magnitude better than that obtained in row-oriented ones. This fact motivates the implementation of their system SW-Store as an extension of the column-oriented database C-Store [45]. SW-Store leverages all the advantages reported above, but also suffers from lack of scalability for queries with unbounded predicate. SW-Store, like some other approaches (such as those reviewed below: Hexastore, RDF3X, and BitMat), first perform a dictionary encoding that maps long URIs and literal values to integer IDs. This decision allows triples to be rewritten as three-ID groups, and this is the representation finally stored in the database. [44] show additional experiments on VP. They replace C-Store by MonetDB [32] in the database layer; these systems show a couple of differences [43]: \((i)\) data processing in C-Store is disk-based, while it is memory-based in MonetDB; and \((ii)\) C-Store implements carefully optimized merge joins and makes heavy use of them, whereas MonetDB uses merge joins less frequently. Even so, MonetDB arises as a competitive choice in this scenario [44].

2.2 Native solutions

Native solutions are custom designed from scratch to better address RDF peculiarities. Although some works [4, 11, 25] propose different graph-based models, the main line of research focuses on multi-indexing solutions. Harth and Decker [24] propose a six-index structure for managing quads.Footnote 3 This scheme allows all quads conforming to a given query pattern (in which the context can also be a variable) to be quickly retrieved. This experience has been integrated in some systems within the current state of the art for RDF management.

Hexastore [48] adopts the rationale of VP and multi-indexing, but takes it further. In contrast to VP, Hexastore treats subjects, predicates, and objects equally. That is, while VP prioritizes predicates and indexes pairs (subject, object) around them, Hexastore builds specific indexes around each dimension and defines a prioritization between the other two:

  • For each subject S, two representations (P,O) and (O,P) are built.

  • For each predicate P, two representations (S,O) and (O,S) are built.

  • For each object O, two representations (S,P) and (P,S) are built.

Thus, Hexastore manages six indexes: (S,P,O), (S,O,P), (P,S,O), (P,O,S), (O,S,P), and (O,P,S). In a naive comparison, the VP scheme (sorted by subject) can be seen as an equivalent representation to the index (P,S,O) in Hexastore. Thus, Hexastore stores triples in a combination of sorted sequences that requires, in the worst case, 5 times the space used to index the full dataset in a single triples table. It can be less because some sequences can be shared between different indexes (for instance, the object sequence is interchangeably used in the indexes SPO and PSO). The Hexastore organization ensures primitive solution for all triple patterns, and also ensures that the first step in pairwise joins can always be implemented as fast merge joins. However, its large storage requirements slow down Hexastore when representing large datasets, because it is implemented to be queried in main memory.

RDF3X [34] introduces index compression to reduce the space usage of Hexastore. RDF3X creates its indexes over a single “giant triples table” (with columns \(\mathtt{v}_\mathtt{1},\mathtt{v}_\mathtt{2},\mathtt{v}_\mathtt{3}\)) and stores them in a (compressed) clustered \(\hbox {B}^+\)-tree. Triples, within each index, are lexicographically sorted, allowing SPARQL patterns to be converted into range scans.

The collation order implemented in the RDF3X table causes neighboring triples to be very similar. In most cases, neighboring triples share the values in \(\mathtt{v}_\mathtt{1}\) and \({\mathtt{v}}_\mathtt{2}\), and the increases in \(\mathtt{v}_\mathtt{3}\) are very small. This enables the use of differential compression to represent a given triple with respect to the previous one. This scheme is leaf-oriented within the B\(^+\)-tree, so the compression is individually applied on each leaf. Although the authors test some well-known bitwise codes (\(\gamma \)-codes, \(\delta \)-codes, and Golomb codes [40]), they finally apply a bytewise code specifically designed for differential triple compression, which decompresses much faster than bitwise techniques in exchange for a small loss in compression. RDF3X also manages aggregated indexes (SP), (PS), (SO), (OS), (PO), and (OP), which store the number of occurrences of each pair in the dataset. RDF3X also contributes with a RISC-style query processor that mainly relies on merge joins over the sorted indexes. Besides, it implements a query optimizer mostly focused on join ordering in its generation of execution plans.

RDF3X reports very high performance, outperforming SW-Store by a large margin. This makes it a leading reference in the area. However, despite its compression improvements, the space requirements in RDF3X remain very high. This impacts on the query performance because large amounts of data need to be transferred from disk to memory, which can be very expensive with respect to the query solution itself [43, 44].

BitMat [6] goes one step further in compression, designing query algorithms that directly perform on the compressed representation. BitMat introduces an innovative compressed bit-matrix to represent the RDF structure. It is conceptually designed as a bit-cube \(\mathtt{S}\times \mathtt{P}\times \mathtt{O}\), but its final implementation slices to get two-dimensional matrices: SO and OS for each predicate P, PO for each subject S, and PS for each object O. These matrices are run-length [40] compressed by taking advantage of their sparseness. Two additional bit arrays are used to mark non-empty rows and columns in the bit matrices SO and OS. The results reported for BitMat show that it only overcomes the state of the art for low-selectivity queries. However, it is an interesting achievement because it demonstrates that avoiding materialization of intermediate results is a very significative optimization for these queries.

Finally, hybrid [39] and memory-based stores [9, 27] represent an emerging alternative in this scenario, but their current results are limited for managing small datasets, as previously shown for Hexastore. Their scalability is clearly compromised by the use of structures, like indexes and hash tables, that demand large amounts of memory. Various semantic applications, such as inference-based ones, require scalable memory-based stores because they perform orders of magnitude faster if the entire dataset is in memory [26], and they also support a higher degree of reasoning. New opportunities arise for memory-based stores thanks to the advances in distributed computing. This class of solutions, recently studied [26, 46] on the MapReduce framework, allows arbitrarily large RDF data to be handled in main memory because more nodes can be added to a cluster when more resources are necessary. Yet, these systems still require further research to ensure efficient RDF exchanging [17, 18] and efficient performance in each node.

3 Succinct data structures

Succinct data structures [33] aim at representing data structures (e.g., sets, trees, hash tables, graphs, texts) using as little space as possible and manipulate them in that compressed form. They are able to approach the information theoretic minimum space required to store the original data while retaining direct access to the data and efficiently supporting additional operations on it. These features yield competitive overall performance, because they can implement specific functionality in faster levels of the memory hierarchy due to the space reductions obtained. This section covers the basic concepts about the succinct data structures involved in our approach.

3.1 Binary sequences

Binary sequences (bitstrings) are the basis of many succinct data structures. A bitstring \({\mathcal {B}}[1,n]\) stores a sequence of \(n\) bits and provides efficient solution for three basic operations:

  • rank \(_\mathtt{a}({\mathcal {B}},\mathtt{i})\) counts the occurrences of the bit \(a\) in \({\mathcal {B}}[1,i]\).

  • select \(_\mathtt{a}({\mathcal {B}},\mathtt{i})\) locates the position for the \(i\)-th occurrence of \(a\) in \({\mathcal {B}}\).

  • access \(({\mathcal {B}},\mathtt{i})\) returns the bit stored in \({\mathcal {B}}[i]\).

All these operations can be solved in constant time using \(n+o(n)\) bits of total space: \(n\) bits for \({\mathcal {B}}\) itself, and \(o(n)\) additional bits for the structures used to answer the queries. In this paper, we consider an implementation [19] which uses, in practice, 5 % extra space on top of the original bitstring size and provides fast query solution.

3.2 Directly addressable codes (DACs)

The use of variable-length codes is the basic principle of data compression: The most frequent symbols are encoded with shorter codewords, whereas longer codewords are used for representing less frequent symbols. However, variable-length codes complicate random access to elements in the compressed sequence, which is required in many practical scenarios (as those studied in this paper) for efficient retrieval. Directly Addressable Codes (DACs) [12] are a practical solution to this problem.

DACs start from a variable-length encoding of the symbols in a sequence. Each codeword (a variable-length bit sequence) is accommodated in a number of chunks of length \(b\), using as many chunks as necessary, and thus the encoded sequence can be regarded as a sequence of chunks. Those chunks are reordered to enable direct access to any element in the encoded sequence.

Accessing a codeword in a DAC compressed sequence takes \(O(\log (M)/b)\) time in the worst case, where \(M\) is the longest codeword length. However, this access time is lower for elements with shorter codewords, and these are the most frequent ones.

3.3 \(\hbox {K}^2\)-trees

The \(\hbox {k}^2\)-tree [13] is a succinct data structure for graph representation. It models a graph of \(n\) vertices through its (binary) adjacency matrix, \({\mathcal {M}}\), of size \(n \times n\). Thus, \({\mathcal {M}}[i,j]=1\) iff the vertices represented in the \(i\)-th row and the \(j\)-th column are related.

The \(\hbox {k}^2\)-tree leverages sparseness and clustering features, which arise in some classes of real-world graphs (such as Web graphs [13] and social networks [16]), to achieve compression. These features imply the existence of large “empty” areas (all cells have value 0), and the \(\hbox {k}^2\)-tree excels at compressing them. Conceptually, the \(\hbox {k}^2\)-tree subdividesFootnote 4 the matrix into \(k^2\) submatrices of the same size, which results in \(k\) rows and \(k\) columns of submatrices of size \(n^2/k^2\). Each of those \(k^2\) submatrices is represented in the tree using a single bit that is appended as a child of the root: A bit 1 is used for representing those submatrices containing at least one cell with value \(1\), whereas a 0-bit means that all cells in the submatrix are \(0\). Once this first level is built, the method proceeds recursively for each child with value \(1\) until submatrices full of 0s or the last level of the tree are reached. This process results in a non-balanced \(\hbox {k}^2\)-ary tree in which the bits in its last level correspond to the cell values in the original matrix.

The \(\hbox {k}^2\)-tree is implemented in a very compact way using two bitstrings: T (tree) and L (leaves). T stores all the bits in the \(\hbox {k}^2\)-tree except those stored in the last level. The bits are placed following a levelwise traversal: first the \(k^2\) binary values of the children of the root node, then the values of the second level, and so on. This configuration enables the \(\hbox {k}^2\)-tree to be traversed by performing efficient rank and select operations on the bitstring. On the other hand, L stores the last level of the tree, comprising the cell values in the original matrix. Although L is conceptually a bistring, in fact, the decomposition stops when the matrices reach size \(k_L \times k_L\) and DACs are used to compress the submatrices according to frequency, while retaining fast direct access to any leaf submatrix.

Besides its compression ability, the \(\hbox {k}^2\)-tree provides various navigational operations on the graph. In particular, for a given vertex \(v\), the \(\hbox {k}^2\)-tree supports the operations of retrieving \((i)\) all the vertices pointed by \(v\) (direct neighbors), and \((ii)\) all the vertices that point to \(v\) (inverse neighbors). Additionally, range queries (retrieving all the connections within a submatrix), and the fast check of a given cell value, are also supported by the \(\hbox {k}^2\)-tree.

Conceptually, direct neighbors retrieval, for a vertex \(v_i\), requires finding all the cells with value \(1\) in the \(i\)-th row. Symmetrically, the inverse neighbors of \(v_i\) are retrieved by locating all the 1s in the \(i\)-th column. Both operations are efficiently implemented on a top-down traversal of the tree, requiring O(1) time per node visited (and \(O(n)\) overall time in the worst case, but usually much less in practice). This traversal starts at the root of the \(\hbox {k}^2\)-tree, \(pos=0\), and visits in each step the children of all nodes with value 1 in the previous level and whose matrix area is not disjoint from the area one wants to retrieve. Given a node, represented at the position \(pos_i\) of T, its \(k^2\) children are represented consecutively from the position \(pos_i=rank_1(T,pos) \cdot k^2\) of T:L. Analogous algorithms implement the range queries and the check for a given cell value.

Fig. 3
figure 3

Example of \(\hbox {k}^2\)-tree for an adjacency matrix of size \(16 \times 16\)

Example

Figure 3 shows a \(16\times 16\) adjacency matrix (left) and the \(\hbox {k}^2\)-tree (right) representing it, using \(k=2\). The configurations for the two bitstrings, T and L, implementing the \(\hbox {k}^2\)-tree, are also shown on the bottom. The matrix is conceptually divided into \(2^2=4\) submatrices. In the first step, the submatrices are of size \(8 \times 8\). To retrieve the direct neighbors of the \(11\)-th vertex, we must find all the cells with value \(1\) in the \(11\)-th row of the adjacency matrix. The first step starts at the root of the \(\hbox {k}^2\)-tree, \(pos=0\), and computes the children overlapping the eleventh row. These are the third and the fourth children (representing the submatrices at the bottom of the original adjacency matrix), and these are, respectively, represented in \(\mathbf{T}[2]\) and \(\mathbf{T}[3]\) (assuming that positions in T are numbered from 0). In both cases, \(\mathbf{T}[2]\) and \(\mathbf{T}[3]\) have value 1, so both children must be traversed. For simplicity, we only detail the procedure for the third child, so now \(pos=2\). The second step first computes the position representing the first child of the current vertex: \(pos=rank_1(\mathbf{T},2) \cdot 2^2=8\) and checks the value of the \(k^2=4\) bits stored from \(\mathbf{T}[8]\): \([0100]\). In this case, only the second child (represented at \(pos=9\)) has value 1, so this is the node to be visited in the third step. The children for this node are located from \(pos=rank_1(\mathbf{T},9) \cdot 2^2=28\), and contain values \([0101]\). Although the second child is 1, this is not a valid match for our query because it has no intersection with the \(11\)-th row. This means that only the fourth child (represented at \(pos=31\)) is visited in the fourth step. The new position \(pos=rank_1(\mathbf{T},31) \cdot 2^2=56\) is larger than \(\vert \mathbf{T} \vert = 36\), so it represents a leaf. Thus, the \(k^2\) resulting leaves must be checked from position \(56-36=20\) of L. The bits \([1000]\) represent this submatrix, so one connection is found for the \(11\)-th row, and it is the \(7\)-th column.

4 Compressed vertical partitioning on \(\hbox {k}^2\)-triples

This section describes how the \(\hbox {k}^2\)-tree structure can be applied to the problem of RDF storage. Our approach is called \(\hbox {k}^2\)-triples. We first perform a specific dictionary encoding that allows triples to be managed as three-ID groups: \((id_1, id_2, id_3)\), in which \(id_1\) is the integer value that identifies the subject in the dictionary, \(id_2\) identifies the predicate, and finally \(id_3\) identifies the object. This decision simplifies data partitioning on \(\hbox {k}^2\)-trees because a direct correspondence can be established between rows and columns in the adjacency matrix and subject and object IDs in the dictionary.

4.1 Dictionary encoding

Dictionary encoding is a common preliminary step performed before data partitioning. All different terms used in the dataset are first extracted from the dataset and mapped to integer values through a dictionary function. As explained above, this allows long terms occurring in the RDF triples to be replaced by short IDs referencing them within the dictionary. This simple decision greatly compacts the dataset representation and mitigates scalability issues.

We propose a dictionary organization comprising four independent categories, in which terms are usually organized in lexicographic order [6, 18, 31]:

  • Common subjects and objects (SO) organizes all the terms that play both subject and object roles in the dataset. They are mapped to the range [1,\(\vert \mathtt{SO}\vert \)].

  • Subjects (S) organizes all the subjects that do not play an object role. They are mapped to [ \(\vert \mathtt{SO}\vert +\mathtt{1},\vert \mathtt{SO}\vert +\vert \mathtt{S}\vert \)].

  • Objects (O) organizes all the objects that do not play a subject role. They are mapped to \([\vert \mathtt{SO}\vert +\mathtt{1},\vert \mathtt{SO}\vert +\vert \mathtt{O}\vert \)]. Note this interval overlaps with that for S, since confusion cannot arise.

  • Predicates (P) maps all the predicates to \([\mathtt{1},\vert \mathtt{P}\vert \)].

In this way, terms playing subject and object roles are represented only once. This yields a significant reduction if we consider that up to 60 % of the terms in the dictionary are in the SO area for real-world datasets [31]. Besides, this four-category organization improves performance for subject–object joins because all their possible matches are elements playing both subject and object roles, and all of them are in the range \(\mathtt{[1,\vert \hbox {SO}\vert ]}\). In addition, the intervals for subjects and objects are still contiguous.

How the dictionary is finally implemented is orthogonal to the problem addressed in this paper, and any existing technique in the state of the art could be adapted to our organization. Nevertheless, we emphasize that RDF dictionaries take up to 3 times more space than that required for representing the triples structure [31], so compressed dictionary indexes are highly desirable for efficient and scalable management of huge RDF datasets.

Example

Figure 4 illustrates our dictionary organization over the RDF excerpt used in Fig. 1. As can be seen, the terms Madrid and Spanish Team (playing as subject and object) are, respectively, identified with the values 1 and 2, the three subjects are represented in the range [3,5], and equally the three objects are identified with the same values: {3,4,5}. Finally, the six predicates used in the example are identified with the range [1,6]. On the right of the figure, the ID-based representation of the original triples is shown.

Fig. 4
figure 4

Example of dictionary encoding on \(\hbox {k}^2\)-triples

Fig. 5
figure 5

Vertical partitioning on \(\hbox {k}^2\)-triples (the parameter \(k\) is set to 2)

4.2 Data partitioning

\(\hbox {k}^2\)-triples models RDF data following the well-known vertical partitioning approach. This scheme reorganizes a dataset into \(\vert \) P \(\vert \) disjoint subsets that contain all the triples related to a given predicate. Thus, all triples in a subset can be rewritten as pairs of subject and object (S,O), because the corresponding predicate is implicitly associated with the given subset.

Each subset is independently indexed in a single \(\hbox {k}^2\)-tree that represents subjects and objects as rows and columns of the underlying matrix. That is, each \(\hbox {k}^2\)-tree models an adjacency matrix of \(\vert \) SO \(\vert + \vert \) S \(\vert \) rows and \(\vert \) SO \(\vert + \vert \) O \(\vert \) columns.

All the \(\hbox {k}^2\)-trees used in our approach are physically built with a hybrid policy that uses value \(k=4\) up to level \(5\) of the tree, and then, \(k=2\) for the rest [13]. The leaves, regarded as submatrices of size \(8 \times 8\), are encoded using DACs [12].

Example

Figure 5 (left) shows the vertical partitioning of our excerpt of RDF. As can be seen, six different subsets are obtained (one for each different predicate), and the triples are rewritten as pairs (S,O) within them. For example, the triples for the predicate 5: (3,5,3), (4,5,4), (5,5,4), are rewritten as pairs: (3,3), (4,4), (5,4), and they are then managed within the subset P5 associated with the predicate 5.

The right side of the figure shows the adjacency matrix underlying the \(\hbox {k}^2\)-tree used for representing subset P5. The shadow cells are the area SO in which elements playing as subjects and objects are represented. Note that the matrix is adjusted to the next power of \(k\) for simplicity. In this case, 1-bits are found in the cells (3,3), (4,4), (5,4) in which the triples (3,5,3), (4,5,4), (5,5,4) are represented. As can be seen, the resulting matrix contains a very sparse distribution of 1-bits, and this is the best scenario for \(\hbox {k}^2\)-trees because of their ability to compress large empty areas.

4.3 Indexing predicates for subjects (SP) and objects (OP)

The solution of queries involving variable predicates is the main weakness of systems that implement vertical partitioning. In our case, all \(\hbox {k}^2\)-trees must be traversed for solving triple patterns with unbounded predicate (see next section). This is hardly scalable when a large number of predicates is used in the dataset. In this section, we enhance \(\hbox {k}^2\)-Triples to minimize the number of \(\hbox {k}^2\)-trees that are traversed for solving triple patterns involving variable predicates.

The triple pattern classification, given in Sect. 2, shows that (?S,?P,?O) is the only pattern with unbounded predicate that provides neither value for the subject nor the object. However, this pattern always scans the full dataset to retrieve all the triples within it, so no specific optimizations are possible for it. Thus, we can leverage that subject and/or object values are provided in interesting queries, and use them to optimize the number of \(\hbox {k}^2\)-trees that must be traversed for solving the remaining patterns with unbounded predicate. This is achieved through two specific indexes:

  • The subject–predicate (sp) index stores all the different predicates related to each subject in the dataset.

  • The object–predicate (op) index stores all the different predicates related to each object in the dataset.

Empirical results [18] show that the average size of these lists of predicates for subjects and objects is, at most, one order of magnitude less than the number of total predicates used in real-world datasets. Thus, the great improvement for queries with unbounded predicate comes at the cost of very limited additional space for the sp and op indexes.

Both sp and op indexes rely on a compact representation of their predicate lists. The predicate list maintains, for a given subject, all the different predicates related to it. These predicate lists are common for some subjects, and this can be leveraged to reduce space. For this purpose, we obtain the set of all the different predicate lists (which we call the predicate list vocabulary) and sort them according to their frequency. In this way, the predicate lists appearing in more subjects are encoded with shorter codewords. A similar procedure is applied on the objects. We, respectively, refer to \(\mathbf{V}_{sp}\) and \(\mathbf{V}_{op}\) as the predicate list vocabularies for subjects and objects.

Example

Arrow 1 in Fig. 6 shows the predicate lists obtained for the subjects and objects in the RDF excerpt used in the previous examples. As can be seen, the subject 1 is related to a 1-element list containing the predicate 2; the list for the subject 2 contains the element 6; for the subject 3, its predicate list contains four elements: 1,3,4, and 5; the subjects 4 and 5 are related to 2-element predicate lists containing the elements 4,5. Arrow 2 shows how predicate list vocabularies are obtained. Consider the case of subjects: The list 4,5 is represented in the first position because it is related to two different subjects (4 and 5), whereas the other lists are only related to a single subject. The case of objects is similar: The list 5 is related to the objects 3 and 4, and the remaining lists appear only once. The first lists will receive a shorter codeword.

We propose a succinct vocabulary representation based on two structures:

  • An integer sequence \(\mathcal {S}\) that concatenates all the predicate lists according to their frequency. Thus, the most frequent lists appear at the beginning of the sequence, whereas the least frequent ones are at the end. Each element in \(\mathcal {S}\) takes \(\log (\vert \mathtt {P} \vert )\) bits of space.

  • A bitstring \({\mathcal {B}}\) that delimits and identifies predicate lists within the vocabulary. That is, the \(i\)-th 1-bit in the position \(p\) of the bit sequence means that the predicate list identified as \(i\) finishes in the \(p\)-th position of \(\mathcal {S}\).

\({\mathcal {B}}\) enables efficient list extraction within the vocabulary, because of the \(p\)-th predicate list is stored in \(\mathcal {S}[i,j]\), where \(i=\mathtt {select}_1({\mathcal {B}},p-1)+1\), and \(j=\mathtt {select}_1({\mathcal {B}},p)-1\) (assume \(\mathtt {select}_1({\mathcal {B}}, 0)=1\)).

Fig. 6
figure 6

Building sp and op indexes

This representation allows the sp and op indexes to be easily modeled as integer sequences. We detail sp; the same representation applies to op. The index sp is modelled as a sequence of integer IDs of length \(\vert \mathcal {S} \vert \). In this way, the \(p\)-th value in sp (referred to as \(\textsc {sp}[p]\)) contains the ID of the predicate list related to the subject \(p\), and it can be extracted from V\(_{sp}\) by using the simple extraction process explained above. The elements of index sp are finally represented using DACs. This retains direct access to any position in the index and also leverages the frequency distribution of predicate lists to achieve compression. Note that DACs assign shorter codewords to smaller integers, and these are used for representing the most frequent lists within the vocabulary.

Example

Arrow 3, in Fig. 6, illustrates the final sp and op index configurations. As can be seen, sp lists the IDs [2,3,4,1,1]. This means that the first subject is related to the second predicate list, the second subject to the third list, and so on. For instance, if we want to extract the list of predicates related to the subject 3, we first retrieve its ID, \(\textsc {sp}[3]=4\). Thus, the fourth list must be extracted from V\(_{sp}\). This is represented in \(\mathcal {S}\) from the position \(i=\mathtt {select}_1({\mathcal {B}},3)+1=4+1=5\) to the position \(j=\mathtt {select}_1({\mathcal {B}},4)=8\) and contains the predicates 1,3,4, and 5.

5 Triple pattern solution over \(\hbox {k}^2\)-triples

Triple patterns are the basic lookup unit on RDF triples; more complex SPARQL queries can be translated into plans involving triple patterns. Thus, RDF retrieval strongly relies on the performance achieved for solving triple patterns. This is one of the main strengths of our approach, because \(\hbox {k}^2\)-triples can answer all patterns on the highly-optimized operations provided by the \(\hbox {k}^2\)-tree structure:

  • (S,P,O) is directly implemented on the operation that checks the value of a given cell in the \(\hbox {k}^2\)-tree. That is, the triple (S,P,O) is in the dataset iff the cell (S,O) (in the matrix representing the subset of triples associated with the predicate P) contains the bit \(1\). This operation returns a boolean value, and it is usually required within ASK queries.

  • (S,?P,O) generalizes the previous pattern by checking the value of the cell (S,O) in all the \(\hbox {k}^2\)-trees. The result is an ID-sorted list of all the predicates whose \(\hbox {k}^2\)-tree contains a 1 in this cell. The process can be sped up by first intersecting the predicate lists of sp and op, respectively, associated with S and O, obtaining a list of predicates \(\mathtt {P}_i\) that contain objects related to S as well as subjects related to O. Then, only the \(\hbox {k}^2\)-trees of those \(\mathtt {P}_i\) need be considered for pairs (S,O).

  • (S,P,?O) can be seen as a forward navigation from S to all the objects related to it through predicate P. Thus, it is equivalent to a direct neighbors retrieval that locates all the columns with value 1 in the row associated with the subject S within the \(\hbox {k}^2\)-tree for P. The objects matching the pattern are returned in sorted order.

  • (S,?P,?O) generalizes the previous pattern by performing direct neighbor retrieval in all the \(\hbox {k}^2\)-trees. In this case, the result comprises many ID-sorted lists of objects for the predicates related to S. This is sped up by using the information stored in the sp index. A subject-based query on this sp index returns the predicate list containing all predicates P \(_i\) related to the subject S. Thus, direct neighbors retrieval is only performed on the \(\vert \mathtt{P}_i \vert \hbox {k}^2\)-trees modeling the predicates within the list.

  • (?S,P,O) corresponds to a backwards navigation from O to all the subjects related to it through P. This is equivalent to a reverse neighbors retrieval that locates all the rows with value 1 in the columns associated with the object O within the \(\hbox {k}^2\)-tree for P. The subjects matching the pattern are returned in sorted order.

  • (?S,?P,O) generalizes the previous pattern by performing reverse neighbors retrieval in all the \(\hbox {k}^2\)-trees. In this case, the result comprises many ID-sorted lists of subjects for the predicates related to O. An object-based query on the op index speeds up the query by restricting the predicate list to those P \(_\mathtt{j}\) with which object O relates to some subject.

  • (?S,P,?O) is equivalent to retrieving all the values 1 in the \(\hbox {k}^2\)-tree associated with the predicate P. This is easily implemented by a range query performing a full top-down traversal that retrieves all the pairs (S,O) in the structure.

  • (?S,?P,?O) generalizes the previous pattern by obtaining all the 1s in all the \(\hbox {k}^2\)-trees used f or representing the predicates in the dataset.

6 Join solution over \(\hbox {k}^2\)-triples

The core of SPARQL relies on the concept of Basic Graph Pattern (BGP), and its semantics to build conjunctive expressions by joining triple patterns through shared variables. BGPs are reordered and partitioned into pairs of triple patterns sharing exactly one variable. Thus, the performance of BGPs is clearly determined by the algorithms available for joining these patterns, and also for the optimization strategies to order the joins. This second topic is orthogonal to this paper and is not addressed.

\(\hbox {k}^2\)-triples provides solutions for subject–subject and object–object joins. This overcomes traditional vertical partitioning, which only gives direct support for subject–subject joins, and requires an additional object index for efficient solution of object–object joins. Besides, \(\hbox {k}^2\)-triples also supports subject–object joins. These are efficiently implemented by considering only the common area SO in which nodes playing as subjects and objects are exclusively represented. Our native support for cross-joins is a significant improvement with respect to traditional vertical partitioning, in which framework cross-joins are described as rather expensive and inefficient operations [3]. This fact is a clear weakness for these traditional solutions because cross-joins are the basis for implementing the common path expressions queries. Abadi et al. [3] tackle the path expression query problem in a non-general way: The results of only some selected paths are precalculated and stored for their efficient querying. Finally, it is worth noting that operations involving predicates as join variables are underused in practice [5].

This section describes the algorithms and mechanisms implemented on \(\hbox {k}^2\)-triples for solving joins. We first classify join operations and then detail the join algorithms implemented by our approach. The section ends with the description of the specific mechanisms provided by \(\hbox {k}^2\)-triples for solving each kind of join operation according to our different join algorithms.

6.1 Classifying join operations

Figure 7 categorizes the join operations according to the classes studied in this section. Although all of them refer to subject–object joins, subject–subject and object–object ones are similarly classified and solved on the same guidelines. We refer to ?X as the join variable in each class.

Fig. 7
figure 7

Classification of subject–object joins supported in \(\hbox {k}^2\)-triples

Join operations are organized, by rows, according to the state of the predicates in the two patterns involved in the join:

  • Row no variable predicates lists the joins in which both triple patterns provide their predicates (classes A, B, and C).

  • Row one variable predicate lists the joins in which one triple pattern provides its predicate, whereas the other one leaves it variable (classes D, E, and F).

  • Row two variable predicates lists the joins in which both triple patterns leave as variables their corresponding predicates (classes G and H).

The column-based classification lists join operations according to the state of the nodes in the triple patterns. If we consider that the join variable is represented in two of these nodes, the remaining two determine the classes:

  • Column no variable subject/object lists the joins in which the value of the two not joined nodes are provided (classes A, D, and G).

  • Column one variable subject/object lists the joins in which one triple pattern provides its not joined node, whereas the other one leaves it variable (classes B, E, and H). From this perspective, the class E is split into two subclasses: in E.1, one pattern provides its predicate but leaves variable its node, whereas the other pattern provides the node but leaves as variable its predicate; in E.2, one pattern is full of variables (it does not provide neither the node nor the predicate), whereas the other one provides both the node and the predicate.

  • Column two variable subject/object lists the joins in which both triple patterns leave as variables their not joined nodes (classes C and F).

We note that the eventual class I is not studied because joins full of variables, (?S,?P \(_\mathtt{1}\),?X) (?X,?P \(_2\),?O), are not used in practice.

6.2 Join algorithms

Join algorithms have been widely studied for relational databases [37] and have been recently reviewed from the perspective of semantic Web databases [21]. We gather this experience to propose three join algorithms optimized for performing over \(\hbox {k}^2\)-triples. We will use a simple notation where \(T_l\) and \(T_r\) refer to the left and right triple patterns, respectively, involved in the join.

Chain evaluation This algorithm relies on the foundations of the traditional index join: It first retrieves all the solutions for \(T_l\), and then, each one is used for obtaining all the solutions for \(T_r\). Our implementation first solves the less expensive pattern (assume it is \(T_l\)) and gathers all the values X \(_\mathtt{i}\) obtained for the join variable ?X. All these values are then used for replacement in \(T_r\). Note that some of these values can be duplicated and these must be identified before the replacement. These duplicates may be due to range queries or to multiple direct/reverse neighbors. We implement an adaptive sort [29] algorithm that merges the results obtained for each predicate leveraging that these are returned in sorted order.

Independent evaluation This algorithm implements the well-known merge join: It first solves both triple patterns and then intersects their respective solutions, which come sorted by the join attribute.Footnote 5

Interactive evaluation This algorithm is strongly inspired on the Sideways Information Passing (SIP) mechanism [35]. SIP passes information on-the-fly between the operands involved in the query in a way that the processing performed in one of them can feed back the other and vice versa. Thus, both triple patterns within the join are interactively evaluated and solved without materialization of intermediate results. This interactive evaluation is easily implemented in \(\hbox {k}^2\)-triples by means of a coordinated step-by-step traversal performed on those \(\hbox {k}^2\)-trees involved in the solution of each pattern within the join. In the next example, only two \(\hbox {k}^2\)-trees are involved, the join attribute is the subject in both trees, and the predicates and objects are fixed, but all the other combinations can be handled similarly.

Fig. 8
figure 8

Example of interactive evaluation

Example

Figure 8 illustrates how \(\hbox {k}^2\)-triples implements the interactive evaluation of the join query shown in Fig. 2b. The original SPARQL query, (?X, playFor, Spanish Team) (?X, position, midfielder), is rewritten as (?X, 4,2) (?X,5,4) by performing the ID-based replacement of each term. Thus, the join must be carried out on the \(\hbox {k}^2\)-trees that, respectively, model the predicates 4 and 5. Both \(\hbox {k}^2\)-trees are represented in Fig. 8a.Footnote 6 Columns 2 and 4 are, respectively, remarked for the predicates 4 and 5, since those are the ones we have to join. We consider \(k=2\); the join is implemented as follows:

  1. (a)

    The two matrices \({\mathcal {M}}_4\) and \({\mathcal {M}}_5\) are queried. They are divided into \(k^2=4\) submatrices (Fig. 8a). Both right submatrices in both \({\mathcal {M}}_4\) and \({\mathcal {M}}_5\) are discarded because they do not overlap with the columns involved in the current query. The two pairs of left submatrices have value 1, so both may contain results. Thus, we recursively consider the top-left and the bottom-left submatrices of \({\mathcal {M}}_4\) and \({\mathcal {M}}_5\). Note that we could have had to make more than one recursive call per submatrix, had we obtained more than one relevant top or bottom cell in \({\mathcal {M}}_4\) and \({\mathcal {M}}_5\) (not in this case, where the columns are specified).

  2. (b)

    In the top-left submatrices (Fig. 8b), we discard in turn the right subsubmatrices in \({\mathcal {M}}_4\) and the left subsubmatrices in \({\mathcal {M}}_5\), because they do not intersect the query column. Further, both top subsubmatrices are 0, so we need consider only, recursively, the bottom-left subsubmatrix of \({\mathcal {M}}_4\) paired with the bottom-right subsubmatrix of \({\mathcal {M}}_5\). Similarly, the top-left and top-right subsubmatrices of \({\mathcal {M}}_4\) and \({\mathcal {M}}_5\) are recursively considered on the bottom-left submatrices.

  3. (c)

    The last recursion level (Fig. 8c) compares leaves in \({\mathcal {M}}_4\) and \({\mathcal {M}}_5\). As in the previous step, we discard the cells that do not overlap with the query columns, and intersect the remaining ones. Three cells are possible results in \({\mathcal {M}}_4\): (3,2), (4,2), and (5,2), so only their corresponding counterparts must be evaluated in \({\mathcal {M}}_5\). Whereas the cell (3,4) has value 0, the other two ones, (4,4), and (5,4), contain 1-bits. Thus, the 4 and 5 represent the subjects in the final query result: Iniesta (4) and Xavi (5).

6.3 Implementing joins over \(\hbox {k}^2\)-triples

This section details how \(\hbox {k}^2\)-triples uses the proposed algorithms for solving all the join operations classified in Fig. 7. We will refer to \(\hbox {T}_l\) and \(\hbox {T}_r\) as the first and second patterns involved in each class of join. In general, interactive evaluation can be used uniformly on all the cases, whereas chain and independent evaluation can also be used with different strategies depending on the type of join. As a general rule of thumb, chain evaluation is preferable over independent evaluation when the outcome of one side of the join is expected to be much smaller than the other. Interactive evaluation, instead, adapts automatically to perform in the best way on each case. Finally, we remark that we will use indexes sp and op whenever possible to restrict the set of predicates to consider when the predicate is variable (we will remark this when their usage is less obvious).

Joins with no variable predicates As explained, in these classes of joins, both triple patterns provide their predicates. This ensures high performance for interactive evaluation because only two parallel \(\hbox {k}^2\)-tree traversals must be performed. Chain and independent evaluation are also possible, depending on the number of variable nodes involved in each class:

  • Joins A. It is the simplest class because only the join variable is not provided. Chain evaluation is advantageous when one operand has much fewer results than the other. Otherwise, independent evaluation is better, as it leverages that both patterns return their results in sorted order.

  • Joins B. This class leaves as variable a non-joined node. The subject node of \(\hbox {T}_l\) is variable in the example: (?S,P \(_\mathtt{1}\),?X) (?X,P \(_\mathtt{2}\),O). Chain evaluation is well-suited for this class because it first solves \(\hbox {T}_r\), obtains all the values X \(_\mathtt{i}\) for ?X, and finally replaces them in \(\hbox {T}_l\). In this way, \(\hbox {T}_l\) is transformed into a group of patterns in which ?X is replaced by each X \(_\mathtt{i}\) retrieved from \(\hbox {T}_r\). The final result comprises the union of all the results retrieved for the group of patterns obtained from \(\hbox {T}_l\). Nevertheless, independent evaluation also applies for this class. On the one hand, \(\hbox {T}_r\) is solved through a reverse neighbors query, which returns its results in order. On the other hand, a range query returns all results for \(\hbox {T}_l\), which must then be sorted by X. The results of both operations are finally intersected producing the join result set.

  • Joins C. Both patterns in the join leave as variables their non-joined nodes. Chain evaluation first solves the pattern containing the less frequent predicate (i.e., containing fewer (S,O) pairs), extracts all its pairs, and all their distinct X \(_i\) components are then replaced in the other pattern (note we must remove duplicates in the X \(_i\) list before replacing each in \(\hbox {T}_r\)). Then all the objects found in \(\hbox {T}_r\) for each X \(_i\) are matched with all the subjects found in \(\hbox {T}_l\) for the same X \(_i\). Alternatively, independent evaluation generates all the pairs from both operands, sorted by the ?X component in each casef and intersects the sorted lists.

Joins with one variable predicate These classes comprise a triple pattern providing its predicate, and another that leaves it variable. In this case, interactive evaluation traverses, in parallel, the \(\hbox {k}^2\)-tree associated with the given predicate, and the \(preds\) different \(\hbox {k}^2\)-trees involved in the solution of other triple patterns. In each recursive step, only a subset of the \(preds \hbox {k}^2\)-trees stay active for the corresponding submatrix. Chain and independent evaluation strategies are also possible depending on the number of variable nodes involved in each operand:

  • Joins D. This class, like Joins A, provides the two non-joined nodes but includes a variable predicate (say, that of \(\hbox {T}_r\)). In this case, chain evaluation first solves \(\hbox {T}_l\), obtains all the values X \(_i\) for ?X, and finally replaces them in \(\hbox {T}_r\) for its solution, which becomes a set of access to single cell queries. Independent evaluation is also practical. First, \(\hbox {T}_l\) is efficiently solved with a direct neighbors query and its results are retrieved in order. Second, \(\hbox {T}_r\) performs \(preds\) inverse neighbor queries to obtain the result set for (?X,?P,O), which must be adaptively sorted (for grouping the X \(_j\) values) before the final intersection. Note that not only the op index can be used to restrict the predicates of \(\hbox {T}_r\) to those related to O, but we can also restrict using sp to the union of all the X \(_i\) values.

  • Joins E. This class is split into two subclasses according to the pattern that contains the unbounded predicate and the variable non-joined node.

    • E.1. Chain evaluation can choose between two strategies, depending on which starts with the smaller set of candidates. On the one hand, \(\hbox {T}_r\), which contains the unbounded predicate and provides the non-variable node, can be first solved and its results be adaptively sorted to remove duplicates. These results X \(_j\) for ?X are then replaced in \(\hbox {T}_l\) for its final solution using inverse neighbor queries. On the other hand, we could collect all the (S,X \(_i\)) pairs from P \(_1\), remove duplicates in X \(_i\), and run access to single cell queries on all the qualifying \(\hbox {k}^2\)-trees for \(\hbox {T}_r\). Independent evaluation is also possible, much as done for Join D operations.

    • E.2. Chain evaluation first solves \(\hbox {T}_r\) and all their bindings X \(_j\) for ?X are then used for replacement in \(\hbox {T}_l\) (which is full of variables in this case) using \(preds\) inverse neighbor queries.

  • Joins F. In this class, \(\hbox {T}_l\) only provides the predicate, and \(\hbox {T}_r\) is full of variables. Chain evaluation first solves \(\hbox {T}_l\), filters duplicate X \(_i\) values, and these are finally used for solving \(\hbox {T}_r\). This last step is restricted using index sp.

Joins with two variable predicates The triple patterns in this class leave their two predicates as variables. This means that interactive evaluation traverses in parallel all the different \(\hbox {k}^2\)-trees involved in the solution of each pattern. Chain and independent evaluation can proceed as follows.

  • Joins G. This class provides the non-joined nodes and leaves the predicates as variables. Chain evaluation first solves \(\hbox {T}_l\), its bindings for ?X are cleaned from duplicates, and these are finally replaced in \(\hbox {T}_r\) for its solution. Independent evaluation is also suitable. It retrieves the results for each pattern sorted by their ?X component and then intersects the sorted lists.

  • Joins H. In this case, \(\hbox {T}_l\) is full of variables and \(\hbox {T}_r\) binds the non-joined node. Chain evaluation first solves \(\hbox {T}_r\) and its results, clean from duplicates, are used for solving \(\hbox {T}_l\).

Table 1 Summary of solution of joins in \(\hbox {k}^2\)-triples (\(^*\) means that removing duplicates is required for solving joins)

Table 1 summarizes all presented choices for each class of join. The first column indicates the class of join and the second illustrates a representative of the corresponding class. Column chain evaluation describes how this join strategy is carried out, that is, \(\hbox {T}_l \rightarrow \hbox {T}_r\) means that \(\hbox {T}_l\) is first executed and its results are used for solving \(\hbox {T}_r\), and vice versa. Column independent evaluation indicates the classes where this strategy can be efficiently used. Finally, column interactive evaluation indicates the \(\hbox {k}^2\)-tree operations interactively performed for solving each triple pattern in the join. We indicate with “\(\times preds\)” the cases where interactive operations involve unbounded predicates.

7 Experimentation

This section studies the performance of \(\hbox {k}^2\)-triples on a heterogeneous experimental setup comprising real-world RDF datasets from different areas of knowledge. We study both compression effectiveness and querying performance and compare these results with respect to a consistent set of techniques from the state of the art.

7.1 Experimental setup

We run experiments on an AMD-Phenom™-II X4 955@3.2 GHz, quad-core (4 cores–4 siblings: 1 thread per core), 8GB DDR2@800MHz, running Ubuntu 9.10. We built two prototypes:

  • \(\hbox {k}^2\)-triples, the vertical partitioning on \(\hbox {k}^2\)-trees without the sp and op indexes.

  • \(\hbox {k}^2\)-triples \(^+\), which enhances the basic vertical partitioning model with the indexes sp and op.

Both prototypes were developed in C and compiled using gcc (version 4.4.1) with optimization -O9.

RDF stores We compare our results with respect to three representative techniques in the state of the art (Sect. 2):

  • A vertical partitioning solution following the approach of [3]. We implement it over MonetDB (MonetDB Database Server v1.6, Jul2012-SP2) because it achieves better performance than the original C-Store based solution [44].

  • A memory-based system implemented over Hexastore.Footnote 7

  • A highly-efficient store: RDF3X,Footnote 8 which was recently reported as the fastest RDF store [26].

All these techniques had been tested following the configurations and parameterization provided in their original sources.

RDF datasets We design a heterogeneous RDF data test for testing \(\hbox {k}^2\)-triples with respect to different data distributions, showing that our approach is competitive in a general scenario. We choose four datasets based on the amount of triples, topic coverage, availability and, if possible, previous use in benchmarking:

  • jamendo Footnote 9 is a repository of Creative Commons licensed music.

  • dblp Footnote 10 provides information on Computer Science journals and proceedings.

  • geonames Footnote 11 is a geographical database covering all countries and containing a large number of place names.

  • dbpedia Footnote 12 is the semantic evolution of Wikipedia, an encyclopedic dataset. dbpedia is considered the “nucleus for a Web of Data” [7].

Table 2 Statistical dataset description

The main statistics of these datasets are given in Table 2. Note that a preprocessing phase is applied to all datasets. First, we represent all the datasets in N-Triples [20], one of the most basic formats containing one sentence per line. If the original dataset was not in N-Triples, it is converted to this raw format with the Any23 toolFootnote 13 (version: any23-0.6.1). Finally, the dataset file is lexicographically sorted and duplicate triples are discarded. Table 2 shows the size in N-Triples and the number of triples, different predicates, subjects and objects of each dataset after this preprocessing.

As can be seen, we consider datasets of different sizes; we include jamendo as a small dataset comprising 1 million triples. This allows \((i)\) testing our proposal also at small sizes and \((ii)\) comparing it with other solutions indexing uncompressed data in memory. In turn, we choose large datasets of incremental size to show the overall scalability of the proposal, managing from 46 Million triples in dblp up to more than 232 Million triples in dbpedia.

As stated, \(\hbox {k}^2\)-triples represents subjects and objects as rows and columns of the underlying matrix. We show the number of different subjects and objects in the last columns of Table 2. As one could expect, the number of different subjects is significantly lower than the number of objects; subjects describe the resources in RDF, and they usually appear in multiple triples, whereas objects hold the values of the descriptions, which could be unique for all the dataset (e.g., a concrete timestamp, a textual description, and an ID field).

Table 2 also reflects the fact that the number of predicates is generally low, including three datasets with 26, 27, and 28 different predicates. However, since queries with unbounded predicate are poorly solved using traditional solutions based on vertical partitioning, we choose dbpedia as an extreme case in which the number of predicates grows to the order of thousands due to the variability of the represented information. This allows us analyzing how \(\hbox {k}^2\)-triples performs when the number of predicates increases. This is the worst case for queries with unbounded predicate, on which VP-based solutions lack scalability.

Queries. We design experiments focused on demonstrating the retrieval ability of all RDF stores included in our setup. First, we run triple pattern queries to analyze basic lookup performance. These results feed back join experiments which, in turn, predict the core performance for BGP solution in SPARQL.

We design a testbedFootnote 14 of randomly generated queries covering the entire spectrum of triple patterns and joins. For each dataset, we consider 500 random triple patterns of each type. Note that in all datasets, except for dbpedia, the triple pattern (?S,P,?O) is limited by the number of different predicates.

Join tests are generated by following the aforementioned classification (A–H, as shown in Fig. 7 for subject–object joins), and for each one, we obtain specific joins subject–object (SO), subject–subject (SS), and object–object (OO). We generate 500 random queries of each join and perform a big-small classification according to the number of intermediate results: For each join, we take the product of the number of results for the first triple pattern and the results of the second triple pattern in the join. Given the mean of this product, we randomly choose 25 queries with a number of intermediate results over the mean (joins big) and other 25 queries with fewer results than the mean (joins small).

We design two evaluation scenarios to analyze how I/O transactions penalize on-disk RDF stores included in our setup. The warm evaluation is designed to help query results be available in main memory. It was implemented taking the mean solution time of six consecutive repetitions of each query. On the other hand, the cold evaluation illustrates a real scenario in which queries are independently performed. All the reported times were averaged on five independent executions in which the elapsed time was considered.

7.2 Compression results

We first focus on compression performance to measure the ability of \(\hbox {k}^2\)-triples to work in severely reduced space. This comparison involves on-disk based representations, MonetDB and RDF3X, and memory-based ones, Hexastore and our two \(\hbox {k}^2\)-triples based approaches. In these cases, we consider the space required for operating the representations in main memory. Table 3 summarizes the space requirements for all stores and datasets in the current setup.

Table 3 Space requirements (all sizes are expressed in MB)

Among previous RDF stores, MonetDB is the most compact one. This is an expected result according to the compressibility of column-oriented representations [2]. MonetDB demands roughly 4 times less space than RDF3X for the smallest datasets, matching the theoretically expected difference according to the features of each underlying model. This difference is greater for dbpedia: In this case, MonetDB uses \(\approx 5.4\) times less space than RDF3X. On the other hand, Hexastore reports an oversized representation for jamendo and cannot index the other datasets in our configuration.

Nevertheless, \(\hbox {k}^2\)-triples requires much less space on all the datasets. It sharply outperforms the other systems, taking advantage of its compact data structures. This result can be analyzed from three complementary perspectives:

  • \(\hbox {k}^2\)-triples is more effective than column-oriented compression for vertically partitioned representations. The comparison between our approach and MonetDB shows that \(\hbox {k}^2\)-triples requires several times less space than the column-oriented database. The space used by MonetDB for the largest datasets is 2–5.5 times larger than \(\hbox {k}^2\)-triples and 1.5–4.5 times larger than \(\hbox {k}^2\)-triples \(^+\). Besides, we also provide indexed access by object within this smaller space.

  • \(\hbox {k}^2\)-triples allows many more triples to be managed in main memory. If we divide the number of triples in jamendo (\(1{,}049{,}644\)) by the space required for their memory-based representation in Hexastore (\(1{,}371.25\) MB), we obtain that it represents roughly \(765\) triples/MB. This same analysis, in our approaches, reports that \(\hbox {k}^2\)-triples manages almost \(1.5\) million triples/MB, and \(\hbox {k}^2\)-triples \(^+\) represents more than \(800{,}000\) triples/MB. Although this rate strongly depends on the dataset, its lowest values (reported for dbpedia) are \(\approx 200{,}000\) triples/MB. This means that \(\hbox {k}^2\)-triples increases by more than two orders of magnitude the number of triples that can be managed in main memory on Hexastore because of its compression ability.

  • \(\hbox {k}^2\)-triples provides full RDF indexing in a space significantly smaller than that used for systems based on sextuple indexing. This difference also depends on the dataset; for instance, RDF3X uses roughly 8–10 times the space required by our techniques for representing dbpedia.

\(\hbox {K}^2\)-triples achieves compression ratios between 10 and 40 bits per triple (bpt) in the datasets analyzed in this experiment (excluding the tiny dataset jamendo, where a ratio of 5 bpt is obtained). However, meaningful differences are observed depending on the features of the dataset. In dbpedia, we need almost 34 bpt, as opposed to the 12–15 bpt in geonames and dblp. The main reason for this difference seems to be the high number of predicates used in dbpedia (\(39{,}672\)) in contrast to the other datasets. Many of these \(\hbox {k}^2\)-trees are poorly populated (about 57 % of the predicates contain less than 10 edges, and roughly 81 % less than 100 edges), obtaining poor compression ratios per edge (due to the overhead of storing a full \(\hbox {k}^2\)-tree for only a few edges).

Finally, we focus on the additional space required by \(\hbox {k}^2\)-triples \(^+\) over the original \(\hbox {k}^2\)-triples representation. Leaving aside jamendo, whose size is tiny in comparison to the other datasets, this extra cost ranges from \(\approx \)20 % for dblp to \(\approx \) 26.5 % for dbpedia. Thus, the use of the additional sp and op indexes incurs in an acceptable space overhead considering that our representation remains the most compressed one even adding these new indexes. This space overhead depends on two main factors. One is the different number of subjects and objects of the dataset. An individual entry representing its corresponding predicate list is maintained for each subject and object, so the cost of storing the indexes sp and op is expected to be high for datasets where each element appears in a few triples. Another factor is that too many different predicates will probably increase the number of different predicate lists. As a result, the size of the predicate list vocabulary is incremented, resulting in a higher space overhead. This factor can be observed in the indexes sp and op for dbpedia, where the cost per element (subject or object) is about 24 bits, in contrast to the other datasets where the cost is about 6 bits per element. Nevertheless, as explained below, sp and op indexes are mainly useful for datasets involving a large number of predicates.

7.3 Query performance

This section focuses on query time performance. We report figures for the most prominent experiments; the remaining ones are given in the appendixes.

Triple patterns These experiments measure the capabilities of all stores for RDF retrieval through triple pattern solution. These are the atomic SPARQL queries and are massively used in practice [5].

Figure 9 compares these times for jamendo (left) and dbpedia (right) in the warm scenario, which is the most favorable for on-disk systems. The x-axis lists all the possible triple patternsFootnote 15 and groups the results for each system; solution times (in milliseconds) are reported in the y-axis (logarithmic scale).

Fig. 9
figure 9

Solution time (in milliseconds) for triple patterns in jamendo and dbpedia (warm scenario)

The comparison for jamendo includes Hexastore. As can be seen, this is never the best choice and it only outperforms MonetDB in patterns with unbounded predicate. According to these results, we discard it because of its lack competitivity in the current setup. On the contrary, \(\hbox {k}^2\)-triples \(^+\) turns out to be the most efficient choice, and only MonetDB slightly outperforms it for (?,P,?) in all collections but dbpedia. Thus, in general, our approach reports the best overall performance for RDF retrieval. This can be analyzed in more detail:

  • Our approach overcomes the main vertical partitioning drawback and provides high performance for solving patterns with unbounded predicate. This is studied on dbpedia because in these queries scalability is more seriously compromised due to the large number of predicates. \(\hbox {k}^2\)-triples \(^+\) leads the scene, whereas RDF3X is close for (S,?,?), falls behind for (?,?,O), and is more than 2 orders of magnitude slower for (S,?,O). As expected, a larger improvement is achieved with respect to our original \(\hbox {k}^2\)-triples (between 1 and 3 orders of magnitude), whereas our achievement is more significant in comparison with MonetDB: The difference ranges from roughly 5 orders of magnitude in (S,?,?) to 8 orders for (S,?,O).

  • MonetDB excels above the other systems in solving the pattern (?,P,?), but this comparison changes in dbpedia. This owes to the fact that predicates tend to be uniformly used in the other datasets, whereas in dbpedia, some predicates are overused and the remaining ones are scarcely used. Thus, the number of results to be obtained differs strongly and it affects the performance. Table 4 summarizes solution times for predicates returning small and big result sets. As can be seen, \(\hbox {k}^2\)-triples \(^+\) dominates for less used predicates, whereas MonetDB is better when more results are retrieved. Thus, the optimized column-oriented representation provides the fastest solution when the predicate is used in numerous triples, whereas \(\hbox {k}^2\)-triples \(^+\) outperforms it for more restrictive predicates.

Finally, it is worth noting that \(\hbox {k}^2\)-triples \(^+\) obtains a competitive advantage over the original \(\hbox {k}^2\)-triples for datasets involving many predicates. For instance, in the case of dbpedia (containing a high number of predicates), the performance of the query pattern (S,?,?) is improved by 2 orders of magnitude. This difference owes to the fact that, in \(\hbox {k}^2\)-triples, \(39,672\) row queries have to be performed in order to answer that pattern. However, considering that the average number of different predicates describing a subject (that is, its predicate list size) is about 6 in this dataset, the number of \(\hbox {k}^2\)-trees that have to be queried is significantly reduced in \(\hbox {k}^2\)-triples \(^+\). However, in the other datasets, the maximum number of predicates that are checked in \(\hbox {k}^2\)-triples is already less than 30. As a result, the improvement in those datasets is lower and both techniques achieve comparable performance, yet \(\hbox {k}^2\)-triples uses slightly less space. Nevertheless, we will use \(\hbox {k}^2\)-triples \(^+\) in all the remaining experiments.

Joins After studying triple pattern performance, the next stage focuses on join solution. The following experiments \((i)\) analyze how our three different join evaluation algorithms perform: chain, independent and interactive, and \((ii)\) compare them with respect to RDF3X and MonetDB. All these experiments are performed in the warm scenario in order to avoid penalizing on-disk solutions.

Table 4 Solution time (in milliseconds) for the patten (?,P,?) on dbpedia (warm scenario)

Figure 10 summarizes join results for dbpedia. This figure comprises 9 plots, one for each class of join according to the classification described in Fig. 7. Each plot comprises three subsets of joins: subject–object (SO), subject–subject (SS), and object–object (OO) in the x-axis. The left group considers joins generating a small amount of intermediate results, whereas the right group gives equivalent results for joins involving big intermediate result sets. Solution times (in milliseconds) are reported on the y-axis (logarithmic scale); note that times over \(10^7\) milliseconds are discarded in all the experiments. We analyze results:

Fig. 10
figure 10

Solution time (in milliseconds) for joins in dbpedia (warm scenario)

  • \(\hbox {k}^2\)-triples \(^+\) is the fastest technique for solving joins in which the value of the two not joined nodes are provided (classes A, D and G). This is mainly because all these classes are solved using, exclusively, direct and reverse neighbors queries, which are very efficient in practice. Both chain and interactive evaluation algorithms dominate Join A: They report, at least, one order of magnitude of improvement with respect to RDF3X and MonetDB. Chain evaluation is slightly faster in Join D, improving upon RDF3X by more than one order of magnitude (except for OO big). Note that, in this case, MonetDB is no longer competitive since it pays the penalty of solving a pattern with unbounded predicate. Finally, interactive is the fastest choice for Join G, although chain overcomes it for OO joins. While \(\hbox {k}^2\)-triples \(^+\) is always faster than RDF3X in all cases, it is worth noting that differences are reduced due to the need of solving two patterns with unbounded predicate. Still we remain competitive, while the performance of the vertical partitioning in MonetDB collapses (no times are drawn in this class).

  • The second column comprises joins leaving variable a not joined node and fixing the other one (classes B, E, and H). \(\hbox {k}^2\)-triples \(^+\) and RDF3X share the lead in these experiments, whereas MonetDB remains competitive only in Join B, although it is never the best choice. On the one hand, the results for Join B and E1 leads to similar conclusions. \(\hbox {k}^2\)-triples \(^+\) is the best choice for joins generating small intermediate result sets: chain is fastest for OO, and interactive for SO and SS. RDF3X overcomes \(\hbox {k}^2\)-triples \(^+\) when big intermediate result sets are obtained, although our chain evaluation obtains the best performance for OO joins. On the other hand, Join E2 and H give similar conclusions, as well. In this case, RDF3X always achieves the best times, except for OO joins, in which chain evaluation is the most efficient choice again. In this case, interactive evaluation is less competitive because it performs multiple range queries.

  • The third column comprises joins in which both triple patterns leave as variables their not joined nodes (classes C and F). In Join C, RDF3X is the best choice for SO and OO joins, whereas MonetDB wins for SS. Note that our approach remains competitive for SS and SO, but its performance is significantly degraded for OO. In Join F, our chain evaluation competes with RDF3X for the best times, overcoming it for SS small. However, this turns out to be the most costly query; note that no technique finishes on OO joins involving big intermediate results.

Summarizing, \(\hbox {k}^2\)-triples \(^+\) excels when triple patterns provide values for the non-joined nodes, an it is clearly scalable when predicates are provided as variables. Thus, in general terms, a query optimizer using \(\hbox {k}^2\)-triples \(^+\) must favor firstly joins A, D, or G; then joins B, E, and H; and finally joins C and F. In any case, joins involving small intermediate result sets are always preferable over those generating big intermediate results.

These findings also apply, in general form, for the remaining datasets in our setup. As we show in Appendix 2, all of them draw broadly comparable figures. It is worth noting that \(\hbox {k}^2\)-triples \(^+\) overcomes the other techniques for the smallest dataset (jamendo), dominating the comparison in most cases, and coming very close to RDF3X in Joins C and F. Another interesting aspect is that MonetDB is the one profiting most from the reduced number of predicates; it reports the best performance in some particular cases.

8 Conclusions and future work

This paper introduces a specific compressed index for RDF called \(\hbox {k}^2\)-triples. Based on the well-known vertical partitioning model, this technique represents (subject, object) pairs in very sparse binary matrices, which are effectively indexed in compressed space using the recent \(\hbox {k}^2\)-tree structures. This modeling achieves the most compressed representations with respect to a state-of-the-art baseline and also provides the most efficient performance for solving triple patterns with fixed predicate. To overcome the lack of scalability arising for patterns involving unbounded predicates, two additional indexes are also represented in compressed space. This simple enhancement makes our technique dominant for most RDF retrieval activities. Moreover, we report better numbers for join solution, outperforming the state of the art for some classes of join, while being competitive in most of the others.

Our future work focuses on getting a full-fledged RDF store over the current \(\hbox {k}^2\)-triples approach. This means, on the one hand, designing a specific query optimizer able to leverage the current retrieval features and also the reported performance for join operations. This would allow BGPs to be efficiently solved and set the basis for providing a full SPARQL solution. On the other hand, we will work to obtain a dynamic RDF storage that allows insertion, deletion, and updating of triples over the current data partitioning scheme. These objectives are strongly stimulated by the recent advances reported on dynamic \(\hbox {k}^2\)-trees [14].