RSP4J: An API for RDF Stream Processing

Tommasini, Riccardo; Bonte, Pieter; Ongenae, Femke; Della Valle, Emanuele

doi:10.1007/978-3-030-77385-4_34

Riccardo Tommasini¹⁶,
Pieter Bonte¹⁸,
Femke Ongenae¹⁸ &
…
Emanuele Della Valle¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12731))

Included in the following conference series:

European Semantic Web Conference

2714 Accesses
16 Citations
1 Altmetric

Abstract

The RDF Stream Processing (RSP) community has proposed several models and languages for continuously querying and reasoning over RDF streams over the last decade. They each have their semantics, making them hard to compare. The variety of approaches has fostered both empirical and theoretical research and led to the design of RSPQL, i.e., a unifying model for RSP. However, an RSP API for the development under RSPQL semantics was still missing. RSP community would benefit from an RSP API because it can foster comparable and reproducible research by providing programming abstractions based on RSPQL semantics. Moreover, it can encourage further development and in-use research. Finally, it can stimulate practical activities such as tutorials, lectures, and challenges, e.g., during the Stream Reasoning Workshop.

In this paper, we present RSP4J, a flexible API for the development of RSP engines and applications under RSPQL semantics. RSP4J offers all the necessary abstractions required for fast-prototyping of RSP engines under the proposed RSPQL semantics. Users can configure it to reproduce the variety of RSP engine behaviors in a comparable software environment. To promote systematic and comparative research, RSP4J is open-source, provides canonical citation, permanent web identifiers, and a comprehensive user guide for developers.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Contrasting RDF Stream Processing Semantics

Towards a Unified Language for RDF Stream Query Processing

A Comparative Study of Stream Reasoning Engines

1 Introduction

The advent of the Internet of Things and social media has unveiled the streaming nature of information [9]. Data analysis should not only consider huge amounts of data from various complex domains, it should also be executed rapidly, before the data are no longer valuable or representative. Stream Reasoning (SR) is the research area that combines Stream Processing and Semantic Web technologies to make sense, in real-time, of vast, heterogeneous and noisy data streams [13].

Since 2008, the SR community’s contributions include data models, query languages, and algorithms, and benchmarks for RDF Stream Processing (RSP). As an extension of the Semantic Web stack, the value of RSP emerges in application domains where Data Variety and Data Velocity appear together [10], e.g., Smart Cities, e-Health and news analysis.

RSP approaches extend RDF and SPARQL to represent and process data streams. The community rapidly reached consensus around the use of RDF Streams as the data model. On the other hand, a variety of RSP languages emerged over time, e.g., C-SPARQL, CQELS-QL, SPARQL\(_{stream}\), and Strider-QL. Such languages are extensions of SPARQL that support some form of continuous semantics. RSP languages are usually paired with working prototypes that helped proving the feasibility of the approach as well as studying its efficiency. Such variety of languages and systems enriches the state-of-the-art, but it may be hindering adoption and, thus, slow down the technological progress. Indeed, the diversity in the literature often opposes the identification of a clear winner, the establishment of best practices, and calls for comparative research.

Like other communities, e.g., OWL reasoning [20] and Big Data Systems [1], comparative research on and benchmarking of RSP engines is extremely hard. In fact, the semantics of different RSP languages do not completely overlap [12]. Moreover, the development of RSP engines, which are time-based systems, implies a number of design decisions that are often hidden in the code [7]. Such decisions, which fall into the notion of execution semantics^{Footnote 1}, hamper the performance comparison, making it hard to reproduce the same behavior in two different systems and, thus, generalise the conclusions.

In summary, the lack of standardization and shared design principles are obstructing the growth of the communities. Indeed, as prototyping efforts remain isolated, the costs of development and maintenance of prototypes remain on the shoulder of individual researchers. Nevertheless, the problem did not remain unnoticed. The OWL reasoning community worked on shared APIs to standardise the evaluation of OWL reasoners [16], fostering a number of initiative like the OWL Reasoning Evaluation (ORE) challenges [20]. The Big Data Systems community witnessed the publication of a number of surveys and unification projects. In particular, Apache Beam is an attempt to uniform the APIs for stream and batch Big Data processing [17].

The RSP community is also working actively on solving this issue, focusing on (i) designing best practices [24, 27] (ii) disseminating the approaches [15], and (ii) developing benchmarks that take correctness and execution semantics into account [3, 18]. A recent important result is RSPQL [14], a reference model that unifies existing RSP dialects and the execution semantics of existing RSP engines. Although RSPQL is a first step towards a community standard, existing prototypes still do not follow shared design principles.

In line with the OWL APIs and Apache Beam initiative, an API based on RSPQL would reduce the maintenance cost of existing engines, foster adoption of RSP engines, open new research opportunities in Stream Reasoning.

In this paper, we present RSP4J, a configurable RSPQL API and engine, that builds on the lessons learned by developing the existing prototypes, and bringing RSP research to the next level. We believe RSP4J can foster fast-prototyping, empirical and comparative research, as well as easing the dissemination of RSP via teaching. To this extent, RSP4J includes (i) all the necessary abstractions to develop RSP engines under the proposed RSPQL semantics and (ii) an implementation, i.e. YASPER, based on Apache Commons RDF^{Footnote 2}, with the goal of showcasing the API’s potential. Moreover, RSP4J can reproduce the variety of RSP engines in a comparable software environment.

In summary, the goals of RSP4J are (i) fostering the design and development of RSP engines under fixed RSPQL semantics, (ii) unifying the existing prototypes and their results, (iii) providing a framework for fair comparison of results and (iv) presenting a high-level API for easy adoption for RSP developers. RSP4J is open-source and is maintained on Github^{Footnote 3}. It has a canonical citation and permanent URL^{Footnote 4}. Moreover, it comes with an actively maintained documentation and a Ready2GoPack for increased availability to new members of the RSP community. RSP4J was already used in a number of tutorials and lectures, i.e. ISWC17, ICWE18, ESWC19, TheWebConf19, RW18/20.

The remainder of the paper is organized as follows: Sect. 2 discusses the potential impact of RSP4J in terms of use-cases, and afterwards presents the requirement analysis. Section 3 presents the background, concepts and definitions used throughout the rest of the paper, while Sect. 4 outlines architecture of RSP4J, its modules, and shows how it satisfies the requirements. In Sect. 5, we presents the related work. Finally, Sect. 6 concludes the paper and summarizes the most important contributions for RSP4J as a resource.

2 Impact: Use Cases and Requirements for an RSPQL API

In this section, we discuss the potential impact of RSP4J as a resource. To this extent, we present different use cases that concern state-of-the-art prototypes for Stream Reasoning and RDF Stream Processing. We highlight the challenges that such use cases unveil, and we elicit a set of requirements for RSP4J in order to address such challenges. Table 1 summarizes the relationship between the challenges (C\(_i\)) and requirements (R\(_j\)).

2.1 Use Cases

Fast Prototyping. In 2008, the first Stream Reasoner prototype came out [31]. Since then, the SR community has designed a number of working prototypes [5, 8, 19, 23], with the intent of proving the feasibility of the vision. E-health, smart cities, and financial transaction are examples of use cases where such prototypes were successfully used. Nevertheless, the effort of designing and engineering good prototypes is extremely high, and often their maintenance is unsustainable. In fact, prototypes are often designed with a minimal set of requirements and without shared design principles. In such scenarios (C1) adding new operators, (C2) new types of data sources to consume, or (C3) experimenting with new optimisations techniques requires huge manual efforts or is almost impossible.

Comparable Research and Benchmarking. Aside developing proof-of-concepts, the SR/RSP communities have focused a lot on Comparative Research (CR) [24, 27] and benchmarking [3, 18, 22, 26, 32]. CR studies the differences and similarities across SR/RSP approaches. Stream Reasoners and RSP engines can only be compared when they employ the same semantics. Thus, a fair comparison demands a deep theoretical comprehension of the approaches, a proper formulation of the task to solve, and an adequate experimentation environment [28]. Consequently, it is currently hard to (C4) reproduce the behavior of existing approaches in a comparable way. Moreover, experimentation is limited by (C5) the lack of parametric solutions, i.e. the configurability of the operators allowing to match engine behavior. On the other hand, research on benchmarking aims at pushing the technological progress by guaranteeing a fair assessment. While some of the challenges are shared with CR [27, 28], benchmarking is empirical research. To this extent, (C6) monitoring both the execution of continuous queries, as well as (C7) the engine behavior at run-time are of paramount importance. Unfortunately, not all the existing prototypes provide such entry points, and only black-box analysis is possible, e.g. it is impossible to measure the performance of each of the engine’s internal operators.

Dissemination. Although SR research is at its infancy, a lot has been done on the teaching side. As prototypes and approaches reach maturity, several tutorials and lectures were delivered at major venues, including ICWE, ESWC, ISWC, RW, and TheWebConf [10, 11, 15]. These tutorials were often practical and aimed at engaging with their audience using simple yet meaningful applications. Nevertheless, existing prototypes were not designed for teaching purposes. Thus, they lack important features like the possibility to (C8) inspect the engine behaviors and (C9) they are not designed to ease the understanding at various levels of abstraction. Indeed, prototypes often (C10) neglect their full compliance to the underlying theoretical framework for practical reasons. Although this approach often benefits performance, it makes the learning curve more steep.

2.2 Requirements

Table 1. Challenges vs Requirements

Full size table

Now we present the requirements that an RSPQL API should satisfy. We elicit the requirements from the challenges presented above. Although the requirements could be generalized for any RSP engine and Stream Reasoner, we restrict our focus to Window-based RDF Stream Processing Engines, i.e., those covered by the RSPQL specification.

R\(_1\) Extensible Architecture. An RSP API should allow the easy addition of data sources (C2) and operators (C1), and the design of optimization techniques (C3). Moreover, An RSP API should allow experimentation by allowing the addition of execution parameters (C5), and should ease the extension of engine capabilities (C7).
R\(_2\) Declarative Access. An RSP API should be accessible in a declarative and configurable manner (C4). It should allow querying according to a formal semantics, e.g., RSPQL (C10), and should allow controlling the engine and the query lifecycles (C8).
R\(_3\) Programming Abstractions. An RSP API should provide programming abstractions that allow interacting with the engine at various levels of abstractions (C9), abstractions that are based on a theoretical framework (C10), and that provide a blueprint to make sense of the engine behavior (C8).
R\(_4\) Experimentation. An RSP API should be suitable for experimentation and, thus, should foster comparative research. To this extent, it should allow experimentation with optimizations techniques (C3), enabling to execute experiments using alternative configurations (C5). Last but not least, the reproducibility of state-of-the-art solutions should be a priority to enable replication studies (C4).
R\(_5\) Observability. An RSP API should be observable by design, enabling the collection of metrics at different levels, i.e., stream level, operator level, query level (C6), and engine level (C7). Observability should be independent from architectural changes (C1 and C2), and ease study of optimizations (C3).

3 Background

In this section, we summarize the knowledge necessary to understand the main concepts of RSP4J. RSP4J is based on RSPQL, which in turn relies on the Continuous Query Language (CQL) [4] for its operation structure, SPARQL 1.1 semantics for RDF querying, and the SECRET model [7] for its operational semantics. Notably, we assume some knowledge on RDF and SPARQL semantics^{Footnote 5}.

Definition 1

A data stream \(\mathcal {S}\) is an infinite sequence of tuples \(\langle d_i,t_e,t_p\rangle \) where, \(d_i\) is a data item, and \(t_e\)/\(t_p\) are respectively the event time and the processing time timestamps. An RDF Stream is a stream where the data item \(d_i\) is an RDF object and t\(_e\)/t\(_p\) are timestamps indicating event time and processing time, respectively.

In the literature, there are many definitions of data stream, with a general agreement on considering them as unbounded sequences of time-ordered data. Different notions of time are relevant for different applications. The most important ones are the time at which a data item reaches the data system (processing time), and the time at which a data item was produced (event time) [2]. In RSP, streams are represented as RDF objects, as stated by Definition 1 [14].

Operationally, stream processing requires a special class of queries that run under continuous semantics (vid. Definition 2). In practice, continuous queries consume one or more infinite inputs and produces an infinite output [25]. Arasu et al. [4] proposed a query model for processing relational streams based on three families of operators, as depicted in Fig. 1. RSPQL extends these operators families to work on RDF Streams.

Definition 2

Under continuous semantics, the result of a query is the set of results that would be returned if the query were executed at every instant in time.

Stream-to-Relation (S2R) (vid. Fig. 1 a), i.e., is a family of operators that bridges the world of streams with the world of relational data processing. These operators chunk the streams into finite portions. A typical operator of this kind is a Time Window operator. In RSPQL, a time-based window operator is defined as in Definition 3.

Definition 3

The time-based window operator \(\mathbb {W}\) is a triple \((\alpha ,\beta ,t^0)\) that defines a series of windows of width (\(\alpha \)) and that slide of (\(\beta \)) starting at \(t^0\).

Relation-to-Relation (R2R) (Fig. 1 b), i.e., is a family of operators that can be executed over the finite stream portions. In the context of RSPQL, R2R operators are SPARQL 1.1 operators evaluated under continuous semantics.

To clarify this intuition, Dell’Aglio et al. introduce the notion of a Time-Varying Graph and RSPQL dataset [14]. A Time-Varying Graph is the result of applying a window operator \(\mathbb {W}\) to an RDF Stream \(\mathcal {S}\) (vid. Definition 4), while the RSPQL dataset (SDS) is a an extension of the SPARQL dataset for continuous querying (vid. Definition 5).

Definition 4

A Time-Varying Graph is a function that takes a time instant as input and produces as output an RDF Graph, which is called instantaneous.

Given a window operator \(\mathbb {W}\) and an RDF Stream S, the Time-Varying Graph TVG\(_{\mathbb {W},S}\) is defined where the \(\mathbb {W}\) is defined.

In practice, for any given time instant t, \(\mathbb {W}\) identified a subportion of the RDF Stream S containing various RDF Graph s. The Time-Varying Graph function returns the union (coalescing) of all the RDF Graph s in the current window^{Footnote 6}.

Definition 5

An RSPQL dataset SDS extends the SPARQL dataset^{Footnote 7} as follows: an optional default graph A\(_0\), n \((n\ge 0)\) named Time-Varying Graphs, and m \((m\ge 0)\) named sliding windows over k \((k \le m)\) data streams.

An RSPQL query is continuously evaluated against an SDS by an RSP engine. The evaluation of a RSPQL query outputs an instantaneous multiset of solution mappings for each evaluation time instant. The RSP engine’s operational semantics determines the set ET of evaluation time instants.

Finally, Relation-to-Stream (R2S) is a family of operators that returns to the world a set of infinite data from the finite ones, i.e., Fig. 1 (c). RSPQL includes three R2S operators: (i) the RStream that emits the current solution mappings; (ii) the IStream that emits the difference between the current solution mappings and previous ones, and; (iii) the DStream that emits the difference between the previous solution mappings and the current ones.

When developing Stream Processing Engines to evaluate continuous queries there are a number of design decisions that might impact the query correctness. Such decisions, which are usually hidden in the query engine implementation, define the so called operational semantics (also known as execution semantics). Botan et al. [7], with their SECRET model, identified a set of four primitives that formalise the operational semantics of window-based stream processing engines. RSPQL incorporates these primitive and applied them on existing RSPQL engines: (i) Scope is a function that maps an event-time instant \(t_e\) to the temporal interval where the computation occurs. (ii) Content is a function that maps a processing-time instant \(t_p\) to the subset of stream elements included in the interval identified by the scope function. (iii) Report is a dimension that characterizes under which conditions the stream processors emit the window content. SECRET defines four reporting dimensions: (CC) Content Change: the engine reports when the content of the current window changes. (WC) Window Close: the engine reports when the current window closes. (NC) Non-empty Content: the engine reports when the current window is not empty. (P) Periodic the engine reports periodically. (iv) Tick is a dimension that explains what triggers the report evaluations. Possible Ticks are: time-driven, tuple-driven, or batch-driven.

4 RSP4J

In this section, we present RSP4J ’s architecture, its components, and we show it satisfies the requirements (cf Sect. 2). Figure 2 shows RSP4J core modules, i.e., (a) Querying, (b) Streams, (c) Operators, (d) the SDS, and (e) the Engine with Execution Semantics. To provide concrete examples of RSP4J, we will use Yet Another Stream Processing Engine for RDF (YASPER). YASPER is a strawman proposal^{Footnote 8} designed for teaching purposes in the context of [15].

4.1 Querying

The query module contains the elements for writing RSPQL programs in a declarative way (R\(_2\)). The syntax is based on the proposal by the RSP community. At this stage of development, RSP4J accepts SELECT and CONSTRUCT queries written in RSPQL syntax (e.g. Listing 1.1)^{Footnote 9}. Although RSPQL [14] does not discuss how to handle multi-streams, RSP4J does, allowing its users to fully replicate the behavior of existing systems (cf R\(_4\)).

Moreover, RSP4J includes the ContinuousQuery interface that aims at making the syntax extensible (cf R\(_1\)). Indeed, RSP4J users can bypass the syntax module and programmatically define extensions in the query language.

4.2 Streams

The Streams module allows providing your own implementation of a data stream. It consists of two interfaces inspired by VoCaLS [30]: the WebStream and WebDataStream. WebStream, represents the stream as a Web resource, while WebDataStream, represents the stream as a data source. Figure 3 provides an overview of the relationships across these classes and interfaces. The WebStream does not include any particular logic. It is identified by an HTTP URI so it can be de-referenced and then consumed through an available endpoint [30]. Listings 1.11 shows the implementation of the WebDataStream interface, which exposes two methods: put and addConsumer. The former allows injection of timestamped data items of type E by producers; the latter connects the stream to interested consumers, e.g., window operators, or super-streams. The interface is generic, and it allows RSP4J ’s users to utilize multiple RDF Stream representation, i.e., either RDF Graphs or Triples, or even non-RDF Web Streams. A WebDataStream might also include some metadata relevant for the processing, i.e., links to ontologies, SHACL schemas, or alternative endpoints.

4.3 Operators

RSP4J core includes separate interfaces for all the RSPQL families of operators: StreamToRelation, RelationToRelation, and RelationToStream. These abstractions act both as lower level APIs for RSP4J ’s users (cf R\(_3\)) as well as a suitable entrypoint for extensions and optimizations (cf R\(_1\)). Moreover, each operator lifecycle could be monitored independently (cf R\(_5\)).

The Stream To Relation operator family bridges the world of RDF Streams to the world of finite RDF Data. RSPQL defines a Time-Based Sliding Window operators for processing RDF Streams. When applied to an RDF stream, RSPQL’s S2R operator returns a function called Time-Varying Graph, that given a time instant t, materializes an Instantaneous (finite) RDF Graph.

To represent such behavior, RSP4J includes two interfaces, i.e., the StreamToRelationFactory interface and the StreamToRelation (S2R) operator. The former, exemplified in Listing 1.4, is used to instantiate the latter. It exposes the apply method that takes a generic I as input, and returns a Time-Varying object O , decoupling the Type of the input stream content from the output Time-Varying Object. Listing 1.5 shows part of an implementation of the StreamToRelationOperatorFactory that instantiates C-SPARQL’s Time-Based Sliding Window.

The StreamToRelation operator is responsible for applying the windowing algorithm. In RSP4J, it is a special kind of Consumer that receives the data from the streams, cf Listing 1.3. Listing 1.5 shows that the factory instantiates a CSPARQLS2ROp, which is a StreamToRelationOperator, and registers it as stream consumer. Then, it obtains a from the ContinuousQueryExecution context. We explain the details about the O when discussing the RSPQL Dataset SDS.

Figure 4 shows the UML class diagram of the S2R package. CSPARQLWindowOperator is a StreamToRelation operator, which creates a if applied to a .

In RSPQL, the Relation To Relation operator family corresponds to SPARQL 1.1 algebraic expressions evaluated over a given time instant. The evaluation of the Basic Graph Pattern produces a time-varying sequence of solution mappings, which can be consumed by SPARQL 1.1 operators.

Listings 1.6 shows the RSP4J interface that covers this functionality. Similarly to the S2R operators, the interface is generic to let the RSP4J ’s users decide the internal representation of the query solution, e.g., the bindings.

The RelationToStream operator family allows going back from the world of Solution Mappings to RDF Streams. According to RSPQL the evaluation of an R2S operator takes as input a sequence of time-varying solution mappings. In RSP4J, we generalized this idea as shown in Listing 1.7, i.e., we allow the user to also provide the solution mapping incrementally as soon as they are produced.

4.4 SDS and Time-Varying Graphs

Like in SPARQL, the query specification and the SDS creation are closely related. An RSPQL dataset SDS is an extension of the SPARQL dataset to support the continuous semantics. As indicated in Sect. 3, the SDS is time-dependent as it contains Time-Varying Graphs. RSP4J includes both the abstractions, i.e., the SDS and the TimeVarying Graphs (cf R\(_3\)).

Listing 1.8 shows RSP4J ’s SDS interface. The generic parameter is inherited by the generic nature of RSP4J ’s Time-Varying objects. The consolidate method consolidates the SDS content by recursively consolidating every Time-Varying Object it contains.

Listing 1.9 shows a Time-Varying Graph that is the result of the application of the Window Operator to an RDF Streams. The method materialize consolidates the content at a given time instant ts. To this extent, it exploits the StreamToRelationOperator interfaces, freezing and polling the active window content. The coalesce method ensure only one graph, among those selected during the windowing operation, is returned. According to RSPQL such graph corresponds to the union of the RDF graphs in the window.

As time progresses, the SDS is reactively consolidated into a set of (named) Instantaneous Graphs^{Footnote 10} at the time t at which a Time-Varying Graph is updated.

Therefore, RSP4J includes the SDSManager and SDSConfiguration interfaces. The former controls the creations, detection, and the interactions with the SDS; ideally this represents a starting point for federated query answering and/or multi-query optimisation. The latter makes the execution parametric e.g., for enabling different approaches for window management, or alternative output serializations, e.g., JSON-LD or Turtle.

4.5 Engine, Query Execution, and Execution Semantics

This module includes the abstractions to control and monitor the engine and the query lifecycle (cf R\(_4\) and R\(_5\)). Moreover, we explain RSP4J ’s parametric execution semantics (R\(_4\)).

The Engine interface allows controlling RSP4J ’s capabilities, e.g., query registration and cancellation. It is based on the VoCaLS service feature idea [30]. Each engine can implement different interfaces, each of which correspond to a particular feature. By querying the implemented interfaces it is possible to list all the features exposed by the engine of choice, e.g., stream registration, RSPQL support, or formatting the results in JSON-LD format.

RSP4J can reproduce the execution semantics of common RSP engines by configuring SECRET’s primitives: Tick is represented as an enumeration, i.e., tuple-based, time-based, and batch. Scope is a parameter accessible through the Time interface. Time controls the time progress w.r.t. the stream consumption. It is initialized with the system initial timestamp at configuration time. It keeps track of the evaluation timestamps ET, and exposes the time-progress to the user both for event-time and processing-time^{Footnote 11}. Report is represented as a collection of ReportingStrategies. RSP4J core includes RSPQL’s reporting policies, i.e., On-Content-Change, Non-Empty-Content, Periodic, and On-Window-Close. Last but not least, the Content interface represents the data items in the active window. It is generic and exposes the coalesce allows alternative implementations of the Time-Varying Graph functions.

The ContinuousQueryExecution interface represents the ever-lasting computation required by continuous queries. It allows monitoring and controlling the query life-cycle. Moreover, in order to make observable (R\(_5\)) the SDS and the operators involved in querying, the interface includes getters.

5 Related Work

In this section, we present the work related to RSP4J. We present the most popular RSP engines, and how they differ in terms of RSPQL semantics, complicating fair comparison.

The C-SPARQL Engine [5] is an RSP engine that adopts a black box approach by pipelining a DSMS system with a SPARQL enige. The DSMS is used to execute the S2R operators and the execution semantics, while the SPARQL engine performs the evaluation of the queries implemented as the R2R operator. C-SPARQL supports the Window Close and Non-empty content reporting policies while employing RStreams as R2S operators.

The CQELS Engine [19] takes a white box approach, such that it has access to all the available operators, allowing it to optimize query evaluation. Compared to C-SPARQL, it supports the Content Change reporting strategy. Furthermore, CQELS supports the IStream R2S operator instead of the RStream.

Morph\(_{stream}\) [8] focuses on querying virtual RDF streams with SPARQL\(_{stream}\). Thus, compared to C-SPARQL and CQELS, it uses Ontology Based Data Access to virtually map raw data to RDF data. Similar to C-SPARQL it supports the Window Close and Non-empty content reporting policies. Morph\(_{stream}\) is the only engine that supports all R2S operators.

Strider [23] is a hybrid adaptive distributed RSP engine that optimizes the logical query plan according to the state of the data streams. It is built upon Spark Streaming and borrows most of its operators directly from Spark. Strider translates Strider-SQL queries to Spark Streaming’s internal operators. It inherits the Window Close reporting policy from Spark Streaming, and supports the RStream as R2S operator.

In addition to the rigid yet explicit characteristics that each engine has, they also have inherent subtle differences. For example, none of them allows to define the starting timestamp \(t^0\) as part of the time-based sliding window operators definition. This means that the starting time is supplied by the engine itself and in case of processing time engines are bound to produce different results. Differences like the \(t^0\) make impossible to correctly compare the results produced by the various engines. RSP4J allows to customize the inner wiring of the engines, in order to align their semantics and allowing them to produce comparable results.

6 Conclusion, Discussion, and Roadmap

In this paper, we presented RSP4J, a flexible API for RSP development, adhering to the semantics of RSPQL, and YASPER, i.e., a strawman RSP engine implementation designed for teaching purposes in the context of RW [15].

RSP4J aims at solving three use case: fast prototyping, benchmarking comparative research, and dissemination via teaching. Thus, we designed it to fulfill, i.e., a set of requirements: (I) an extensible architecture, (II) declarative access through a uniform query language according to the RSPQL semantics; (III) the necessary programming abstractions; (IV) enable experimentation and fair comparison and (V) observability by design. In Sect. 4, we explained how each RSP4J module solves a subset of requirements. Differently than the state-of-the-art RSP prototypes which only solve requirement R\(_2\) by providing a declarative access, RSP4J fulfills all the set requirements (cf Sect. 4). Moreover, two RSP engines already bind to RSP4J: (i) YASPER^{Footnote 12}, which is a strawman implementation based on Apache RDF Commons\(^2\), and C-SPARQL 2.0^{Footnote 13} a new version of the C-SPARQL engine [5].

Roadmap. RSP4J ’s future work includes a number of initiatives. We plan to bind even more engines, i.e., Morph\(_{stream}\) and CQELS and run a reproducibility challenge in the context of the upcoming stream reasoning workshop. In the mid term, we would like to abstract RSP4J specification and provide access in other languages, e.g., Python. In the long-term, we would like to include abstraction to control the stream publication lifecycle [29]. Moreover, we would like to investigate how to combine RSP4J with other stream reasoning framework [6].

Notes

1.
also known as execution semantics.
2.
http://commons.apache.org/proper/commons-rdf/.
3.
https://github.com/streamreasoning/rsp4j.
4.
https://w3id.org/rsp4j.
5.
For a comprehensive analysis we suggest [21].
6.
The current window identified by \(\mathbb {W}\) with the oldest closing time instant at t.
7.
https://www.w3.org/TR/rdf-sparql-query/#specifyingDataset.
8.
https://en.wikipedia.org/wiki/Straw_man_proposal.
9.
The RSP W3C Community group has started working towards a common syntax and semantics for RSP (https://github.com/streamreasoning/RSP-QL).
10.
Slowly evolving RDF graph are represented as a (named) Time-Varying Graph too.
11.
RSPQL determines the evaluation time instant set ET wrt the reporting policy and the input data. Instead, RSP4J serves time as it receives data, i.e., by consuming the streams. Thus, RSP4J ’s ET is built progressively. While the RSPQL’s ET is deterministic, RSP4J ET might not be deterministic in case of distributed computations.
12.
https://github.com/streamreasoning/rsp4j/tree/master/yasper.
13.
https://github.com/streamreasoning/csparql2.

References

Affetti, L., Tommasini, R., Margara, A., Cugola, G., Della Valle, E.: Defining the execution semantics of stream processing engines. J. Big Data 4, 12 (2017)
Article Google Scholar
Akidau, T., et al.: The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing (2015)
Google Scholar
Ali, M.I., Gao, F., Mileo, A.: CityBench: a configurable benchmark to evaluate RSP engines using smart city datasets. In: Arenas, M., et al. (eds.) ISWC 2015, Part II. LNCS, vol. 9367, pp. 374–389. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25010-6_25
Chapter Google Scholar
Arasu, A., Babu, S., Widom, J.: The CQL continuous query language: semantic foundations and query execution. VLDB J. 15(2), 121–142(2006)
Google Scholar
Barbieri, D.F., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: C-SPARQL: a continuous query language for RDF data streams. Int. J. Semant. Comput. 4(1), 3–25 (2010)
Article Google Scholar
Beck, H., Dao-Tran, M., Eiter, T., Fink, M.: LARS: a logic-based framework for analyzing reasoning over streams. In: Bonet, B., Koenig, S. (eds.) Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 25–30 Jan 2015, Austin, Texas, USA, pp. 1431–1438. AAAI Press (2015)
Google Scholar
Botan, I., Derakhshan, R., Dindar, N., Haas, L.M., Miller, R.J., Tatbul, N.: SECRET: a model for analysis of the execution semantics of stream processing systems. PVLDB 3(1), 232–243 (2010)
Google Scholar
Calbimonte, J.P., Jeung, H., Corcho, O., Aberer, K.: Enabling query technologies for the semantic sensor web. Int. J. Semant. Web Inf. Syst. (IJSWIS) 8(1), 43–63 (2012)
Article Google Scholar
Della Valle, E., Ceri, S., van Harmelen, F., Fensel, D.: It’s a streaming world! reasoning upon rapidly changing information. IEEE Intell. Syst. 24(6), 83–89 (2009)
Article Google Scholar
Della Valle, E., Dell’Aglio, D., Margara, A.: Taming velocity and variety simultaneously in big data with stream reasoning. In: DEBS, pp. 394–401. ACM (2016)
Google Scholar
Della Valle, E., Tommasini, R., Balduini, M.: Engineering of web stream processing applications. In: d’Amato, C., Theobald, M. (eds.) Reasoning Web 2018. LNCS, vol. 11078, pp. 223–226. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00338-8_8
Chapter Google Scholar
Dell’Aglio, D., Calbimonte, J.-P., Balduini, M., Corcho, O., Della Valle, E.: On correctness in RDF stream processor benchmarking. In: Alani, H., et al. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 326–342. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41338-4_21
Chapter Google Scholar
Dell’Aglio, D., Della Valle, E., van Harmelen, F., Bernstein, A.: Stream reasoning: a survey and outlook. Data Sci. 1(1–2), 59–83 (2017)
Article Google Scholar
Dell’Aglio, D., Della Valle, E., Calbimonte, J., Corcho, Ó.: RSP-QL semantics: a unifying query model to explain heterogeneity of RDF stream processing systems. Int. J. Semant. Web Inf. Syst. 10(4), 17–44 (2014)
Google Scholar
Falzone, E., Tommasini, R., Della Valle, E.: Stream reasoning: from theory to practice. In: Manna, M., Pieris, A. (eds.) Reasoning Web 2020. LNCS, vol. 12258, pp. 85–108. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60067-9_4
Chapter Google Scholar
Horridge, M., Bechhofer, S.: The OWL API: a Java API for OWL ontologies. Semant. Web 2(1), 11–21 (2011)
Article Google Scholar
Karau, H.: Unifying the open big data world: the possibilities\({_\ast }\) of apache BEAM. In: 2017 IEEE International Conference on Big Data, BigData 2017, Boston, MA, USA, 11–14 Dec 2017, p. 3981. IEEE Computer Society (2017)
Google Scholar
Kolchin, M., Wetz, P., Kiesling, E., Tjoa, A.M.: YABench: a comprehensive framework for RDF stream processor correctness and performance assessment. In: Bozzon, A., Cudre-Maroux, P., Pautasso, C. (eds.) ICWE 2016. LNCS, vol. 9671, pp. 280–298. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-38791-8_16
Chapter Google Scholar
Le-Phuoc, D., Dao-Tran, M., Xavier Parreira, J., Hauswirth, M.: A native and adaptive approach for unified processing of linked streams and linked data. In: Aroyo, L., et al. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 370–388. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25073-6_24
Chapter Google Scholar
Parsia, B., Matentzoglu, N., Gonçalves, R.S., Glimm, B., Steigmiller, A.: The OWL reasoner evaluation (ORE) 2015 competition report. J. Autom. Reasoning 59(4), 455–482 (2017)
Article MathSciNet Google Scholar
Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. (TODS) 34(3), 1–45 (2009)
Article Google Scholar
Le-Phuoc, D., Dao-Tran, M., Pham, M.-D., Boncz, P., Eiter, T., Fink, M.: Linked stream data processing engines: facts and figures. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012, Part II. LNCS, vol. 7650, pp. 300–312. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35173-0_20
Chapter Google Scholar
Ren, X., Curé, O.: Strider: a hybrid adaptive distributed RDF stream processing engine. In: d’Amato, C., et al. (eds.) ISWC 2017, Part I. LNCS, vol. 10587, pp. 559–576. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68288-4_33
Chapter Google Scholar
Scharrenbach, T., Urbani, J., Margara, A., Della Valle, E., Bernstein, A.: Seven commandments for benchmarking semantic flow processing systems. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 305–319. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38288-8_21
Chapter Google Scholar
Terry, D.B., Goldberg, D., Nichols, D.A., Oki, B.M.: Continuous queries over append-only databases. In: Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, 2–5 June 1992, pp. 321–330. ACM Press (1992)
Google Scholar
Tommasini, R., Balduini, M., Della Valle, E.: Towards a benchmark for expressive stream reasoning. In: Joint Proceedings of RSP and QuWeDa Workshops co-located with 14th ESWC 2017, vol. 1870, pp. 26–36 (2017)
Google Scholar
Tommasini, R., Della Valle, E., Balduini, M., Dell’Aglio, D.: Heaven: a framework for systematic comparative research approach for RSP engines. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 250–265. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34129-3_16
Chapter Google Scholar
Tommasini, R., Della Valle, E., Mauri, A., Brambilla, M.: RSPLab: RDF stream processing benchmarking made easy. In: d’Amato, C., et al. (eds.) ISWC 2017, Part II. LNCS, vol. 10588, pp. 202–209. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_21
Chapter Google Scholar
Tommasini, R., Ragab, M., Falcetta, A., Valle, E.D., Sakr, S.: A first step towards a streaming linked data life-cycle. In: Pan, J.Z., Pan, J.Z., et al. (eds.) ISWC 2020, Part II. LNCS, vol. 12507, pp. 634–650. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62466-8_39
Chapter Google Scholar
Tommasini, R., et al.: VoCaLS: vocabulary and catalog of linked streams. In: Vrandečić, D., et al. (eds.) ISWC 2018, Part II. LNCS, vol. 11137, pp. 256–272. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_16
Chapter Google Scholar
Walavalkar, O., Joshi, A., Finin, T., Yesha, Y., et al.: Streaming knowledge bases. In: Proceedings of the Fourth International Workshop on Scalable Semantic Web knowledge Base Systems (2008)
Google Scholar
Zhang, Y., Duc, P.M., Corcho, O., Calbimonte, J.-P.: SRBench: a streaming RDF/SPARQL benchmark. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012, Part I. LNCS, vol. 7649, pp. 641–657. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_40
Chapter Google Scholar

Download references

Acknowledgment

Dr. Tommasini acknowledges support from the European Social Fund via IT Academy program, and from the European Regional Development Funds via the Mobilitas Plus programme (grant MOBTT75). Moreover, the authors would like to acknowledge the support of Robin Keskisärkkä and Daniele Dell’Aglio in earlier versions of this work.

Author information

Authors and Affiliations

Data System Group, University of Tartu, Tartu, Estonia
Riccardo Tommasini
DEIB, Politecnico di Milano, Milan, Italy
Emanuele Della Valle
Ghent University - imec, Ghent, Belgium
Pieter Bonte & Femke Ongenae

Authors

Riccardo Tommasini
View author publications
You can also search for this author in PubMed Google Scholar
Pieter Bonte
View author publications
You can also search for this author in PubMed Google Scholar
Femke Ongenae
View author publications
You can also search for this author in PubMed Google Scholar
Emanuele Della Valle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Riccardo Tommasini .

Editor information

Editors and Affiliations

Ghent University, Ghent, Belgium
Ruben Verborgh
Aalborg University, Aalborg, Denmark
Katja Hose
University of Mannheim, Mannheim, Germany
Heiko Paulheim
ERCIM, Sophia Antipolis, France
Pierre-Antoine Champin
University of Siegen, Siegen, Germany
Maria Maleshkova
Universidad Politécnica de Madrid, Boadilla del Monte, Spain
Oscar Corcho
eBay Inc., San Jose, CA, USA
Petar Ristoski
FIZ Karlsruhe - Leibniz Institute for Information Infrastructure, Eggenstein-Leopoldshafen, Germany
Mehwish Alam

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tommasini, R., Bonte, P., Ongenae, F., Della Valle, E. (2021). RSP4J: An API for RDF Stream Processing. In: Verborgh, R., et al. The Semantic Web. ESWC 2021. Lecture Notes in Computer Science(), vol 12731. Springer, Cham. https://doi.org/10.1007/978-3-030-77385-4_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-77385-4_34
Published: 31 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77384-7
Online ISBN: 978-3-030-77385-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us