Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

11.1 Motivation

Programming Semantic Web applications requires to first learn a complex application programming interface of a Semantic Web framework like the one from Jena (Wilkinson et al. 2003). Furthermore, syntax errors in data files and in queries are only detected when the corresponding program instructions are executed, and error messages are reported from the Semantic Web framework. Therefore, extensive tests are necessary for developing stable Semantic Web programs, which must consider as much as possible from every branch in a program execution and from every possible input data. With our contribution, we want to address the following types of errors by a tool for embedding Semantic Web languages into programming languages, which use a static program analysis for avoiding or detecting these types of errors and therefore support the development of more stable programs: Queries with semantic errors do not generate error messages, but lead to unexpected behaviour of the Semantic Web application. Note that queries with semantic errors often return the empty result set every time, that is, these queries are unsatisfiable, which is a hint for a semantic error in the query. Furthermore, query results often contain data with a certain data type, for example, numeric values, which have to be further processed in data-type-dependent operations such as the summation of numeric values. Therefore, a cast is necessary to a specific programming language-dependant type. If an erroneous query contains values of other types than expected, then cast errors might occur at runtime.

So far suitable tools for embedding Semantic Web data and query languages into existing programming languages, which go beyond the simple use of application programming interfaces and take advantages of an additional program analysis at compile time, are missing. All static program analysis for detecting errors in the embedded languages can be processed at compile time before the application is really executed by starting the precompiler offering the most possible convenience to the programmer: Some tests may be done without using embedded languages, but then have to be done by using additional tools and copying and pasting code fragments, or by executing the program itself. A static program analysis can detect errors, which – without a static program analysis – may only be detected after running a huge amount of test cases, as the static program analysis considers every branch in the application code. Our tool, which we call Semantic Web Objects system (SWOBE), embeds

  • The Semantic Web data language RDF/XML

  • The query language SPARQL

  • The update language SPARUL

into the java programming language. SWOBE supports the development of more stable Semantic Web applications by

  • Providing transparent usage of Semantic Web data and query languages without requiring users to have a deep knowledge of application programming interfaces of Semantic Web frameworks

  • Checking the syntax of the embedded languages for the detection of syntax errors already at compile time

  • A static type check of embedded data constructs for guaranteeing type safety

  • A satisfiability test of embedded queries for the detection of semantic errors in the embedded queries already at compile time

  • A determination of the types of query results for guaranteeing type safety and thus avoiding cast errors.

A demonstration of the SWOBE precompiler is available online at Groppe and Neumann (2008), the example SWOBE programs of which cover embedding of RDF/XML constructs [see Assistant.swb, Student.swb, University.swb and Professor.swb at Groppe and Neumann (2008)], SPARQL queries (see TestStudent.swb and QueryTest.swb), and SPARUL queries (see Assistant.swb, Student.swb, University.swb, Benchmark.swb, Professor.swb, and UpdateTest.swb).

11.3 Embedding Semantic Web Languages Into JAVA

We first provide an overview of SWOBE and demonstrate the features of SWOBE by an example. Afterward, we explain the ideas and concepts of SWOBE in detail in the following subsections. We refer the interested reader to the specifications of RDF/XML (Beckett 2004), SPARQL (Prud’hommeaux and Seaborne 2008), and SPARUL (Seaborne and Manjunath 2008) for an introduction to the embedded data, query, and update languages.

Figure 11.1 contains an example SWOBE program, which uses the RDF format to describe information about students and the courses they take [see lines (16–32) of Fig. 11.1]. The type of the embedded RDF data is defined in lines (6–11) of Fig. 11.1. An embedded SPARQL query [see lines (33–39) of Fig. 11.1] asks for the telephone number of those students, which take at least one course. Additionally, the name and the short name of the courses taken by the student are contained in the query result. Afterward, the telephone numbers of the students and the names or short names respectively of the courses are stored in arrays by iterating through the query result [see lines (41–48) of Fig. 11.1].

Fig. 11.1
figure 1_11

A SWOBE example program. The bold facepart contains specific SWOBE expressions

Figure 11.2 depicts the architecture of the SWOBE precompiler.

Fig. 11.2
figure 2_11

The architecture of the SWOBE precompiler

The SWOBE precompiler first parses the SWOBE program according to the Java 1.6 grammar with the extension of embedded RDF/XML constructs (e.g., lines (16–21) and lines (22–32) of Fig. 11.1), prefix declarations (e.g., lines (2–5) of Fig. 11.1), and SPARQL/-UL queries (e.g., the SPARQL query in lines (33–39) of Fig. 11.1). Syntax errors of embedded RDF/XML constructs and SPARQL/-UL queries are already detected and reported in this phase at compile time, which are otherwise – in the case of using Java 1.6 with Semantic Web application programming interfaces – only detected at runtime maybe after running extensive tests.

We assume S to be the type of the right side of an assignment of RDF data to a variable (lines (16–32) of Fig. 11.1) and T to be the variable type. The type system of the SWOBE precompiler then checks whether or not S conforms to T, that is, if S is a subtype of T.

The satisfiability tester of the SWOBE precompiler afterward checks whether the result of the embedded SPARQL/-UL queries (e.g., lines (33–39) of Fig. 11.1 for a SPARQL query) is empty for any input based on the type of the input data.

The SWOBE precompiler then determines the java types of the results of the embedded SPARQL queries. The SWOBE precompiler uses the java types for generating the special iterators for query results. In the example of Figure 11.1, the SWOBE precompiler generates an iterator for the result of the SPARQL query in lines (33–39). The iterator contains the special methods int getY() and String getZ() for accessing the results of the variables Y and Z, respectively. Note that the SWOBE precompiler also determines the result type of Y to be int and of Z to be String based on the type Students of the input data #students# of the SPARQL query.

At last the SWOBE precompiler transforms the SWOBE program into java classes. The generated java classes use the application programming interfaces (API) of existing Semantic Web frameworks. Our SWOBE precompiler currently supports the API of the widely used Jena Semantic Web framework (Wilkinson et al. 2003), but can be easily tailored to support APIs of other Semantic Web frameworks. The transformed java classes are the main class corresponding to the SWOBE program, some iterator classes for query results like query in Fig. 11.1, and helper classes, methods of which are called by the main class.

11.3.1 The Type System

RDFS/OWL ontologies are designed to handle incomplete information. This design decision leads to the following two phenomena:

  1. 1.

    OWL and RDFS ontologies do not provide any constraints for entities that are not typed. For example, the triple (s, rdf:type, c) types the s entity to be of class c. If an ontology is given, which has constraints for members of the class c like a maximal cardinality one of a property color for entities of the class c, then the triples (s, color, blue) and (s, color, red) are inconsistent with this ontology. However, no entity, not _:b1 and not _:b2, is typed in the triple set {(_:b1, uni:name, “UniversityOfLübeck”), (_:b1, uni:institute, _:b2), (_:b2, uni:name, “IFIS”)}, such that no any ontology can impose constraints on the triples of this triple set. Thus, this triple set conforms to any ontology.

  2. 2.

    Even if an entity is typed, a given ontology does not impose any constraints for properties and objects, which are not listed in the ontology. Thus, a fact (s, p, o) is still consistent with a given RDFS/OWL ontology, even when (s, rdf:type, c) is true and there are no constraints given in the RDFS/OWL ontology about the object o or predicate p for members of the class c.

However, if the types S and T are described by ontologies, the check whether or not a type S is a subtype of another type T would not consider the triples not described by an ontology according to phenomena (1) and (2). In this case, we could only state that S is a subtype of T except of triples according to phenomena (1) and (2). Additionally, the satisfiability test of embedded SPARQL queries based on a given type for the input triples would detect only maybe unsatisfiable queries, which are unsatisfiable for triples without those of phenomena (1) and (2). However, we cannot guarantee the exclusion of triples according to phenomena (1) and (2). Furthermore, the determination of the query result types based on a given type for the input triples fails to consider the triples according to phenomena (1) and (2).

Therefore, we propose to use a type system, which avoids the two aforementioned phenomena of RDFS/OWL ontologies. Note that our type system supports incomplete information by allowing any triples if explicitly stated.

Our developed language for defining the types of embedded RDF data conforms to the EBNF rules of Fig. 11.3.

Fig. 11.3
figure 3_11

EBNF rules for defining types of embedded RDF data

We can define the types of triple sets with this language. If the type is ANY, then there are no restrictions to the triple set. Types for single triples consist of three basic types for the subject, predicate, and object of a triple. A basic type is a concrete URI, literal, or an XML Schema data type. We can exclude values for basic types; for example, string \ “Fritz” allows all strings except “Fritz”. Furthermore, if A and B are types for triple sets, then A | B, A ∪ B, A*, A+, A? and (A) are also types for triple sets: A | B allows triples of type A or B. A triple set V conforms to a type A ∪ B if triple sets V1 and V2 exist, such that V1 ∩ V2={} and V=V1 ∪ V2 and V1 conforms to type A and V2 conforms to type B. Arbitrary repetitions can be expressed by using A* for including zero repetitions, A+ for at least one repetition and A? for zero or one repetition. Bracketed expressions (A) allow specifying explicit priorities between the operators of a type, for example, (A ∪ B)*. References to named types may be used in a type for reusing already defined types.

We further allow types for predicate-object-lists, triples of which have the same subject, and for object-lists, triples of which have the same subject and the same predicate. We do not present these extensions here due to the simplicity of presentation, as these extensions complicate the algorithms for subtype tests due to more cases to be considered, but do not show new insights for subtype tests.

In the example of Fig. 11.1, the type definition Course [see lines (6–7) of Fig. 11.1] describes the RDF data about a course at a university. The type definition Student [see lines (8–10) of Fig. 11.1] describes the RDF data about a student in a university, and the type definition Students [see line (11) of Fig. 11.1] describes a group of students.

11.3.2 Subtype Test

Whenever a variable is assigned with RDF data [e.g., lines (22–32) of Fig. 11.1], we can determine the type of this assigned RDF data. The type S of assigned RDF data [e.g., the assigned RDF data in lines (22–32) of Fig. 11.1] is the union of the types of the triples, which are generated, and the types of embedded variables [e.g., #course in line (24)] containing RDF data. The subtype test is used for checking whether or not the type S of assigned RDF data [e.g., the assigned data in lines (22–32) of Fig. 11.1] conforms to an expected type T [e.g., the type rdf<Students> of the variable students in line (11) of Fig. 11.1] for the content of the assigned variable. In general, the subtype test checks whether or not a type S is a subtype of another type T; that is, the subtype test checks whether or not all possible input data, which are of type S, are also of type T.

We first simplify the type definition T and S according to the formulas presented in Fig. 11.4, such that superfluous brackets are eliminated and subexpressions of the form A θ1 θ2, where θ1, θ2 ∈ {+, *, ?}, are transformed into subexpressions A θ3 with one frequency operator θ3.

Fig. 11.4
figure 4_11

Simplifying type definitions, where A is a type definition

The algorithm checkSubType(S,T) (see Fig. 11.5) performs the task of checking if S is a subtype of T. In the special case that the type S describes an empty triple set [see line (2) of Fig. 11.5), it is tested whether or not T allows the empty triple set by using the function isNullable (see Fig. 11.6). If T allows any input, then any type is a subtype of T [see line (3) of Fig. 11.5]. If T does not allow any input, but S does, then S cannot be a subtype of T [see line (4) of Fig. 11.5]. Otherwise, we first transform the types S and T into tree representations tree(S) and tree(T) by using a function tree. See Fig. 11.7 for the recursive definition of the function tree and an example of its result in Fig. 11.8.

Fig. 11.5
figure 5_11

Main Algorithm for the test if S is a subtype of T

Fig. 11.6
figure 6_11

Algorithm isNullable, where A and B are type definitions, s, p, and o the types of the subject, predicate, and object of a triple

Fig. 11.7
figure 7_11

Transforming a type definition into a tree representation. T1 and T2 represent type definitions and (s, p, o) the type of a triple

Fig. 11.8
figure 8_11

Homomorphism from a type S representing the type of the assigned data in lines (22–32) of Fig. 11.1 on the right side of this figure to a type T representing the type rdf<Students> of the variable students in line (11) of Fig. 11.1 on the left side of this figure

Checking if S is a subtype of T can be reformulated into the problem of finding a homomorphism [see line (5) of Fig. 11.5] from the tree representation tree(S) of S to the tree representation tree(T) of T. We first describe an optimized algorithm for quickly finding such a homomorphism and afterward describe the constraints of the homomorphism in detail for the subtype relation between S and T. We use the tree representation of T and S and a corresponding homomorphism in the example of Fig. 11.8.

A subtype relation is already proved after only one homomorphism relation between two types is found.

We propose to search for a homomorphism between two types in two phases. The first phase determines all single candidate mappings from subexpressions of S to subexpressions of T. The candidate mappings can be determined very fast in the time O(|S|*|T|), where |V| represents the length of a type V, that is, the number of subexpressions of V. The efficient algorithm visits the tree representation of S bottom-up in order to find suitable subexpressions in T. During visiting S bottom-up, the algorithm considers already found mappings for the current node’s children in the tree representation of S to nodes in the tree representation of T.

Afterward, we determine suitable subsets of the candidate mappings, which will be checked by the algorithm homomorphism (see Fig. 11.9) and this is explained in the next paragraph. This second phase is designed to quickly exclude unsuitable subsets of the candidate mappings. Furthermore, we first check those subsets of candidate mappings, which are promising to be a homomorphism. As a subexpression in S must not be mapped to two different subexpressions in T, we exclude this kind of subsets of the candidate mappings. In order to further exclude subsets of the candidate mappings earlier, we consider afterward candidate mappings for a subexpression in S, which is mapped to a minimum number of subexpressions in T in the remaining set of candidate mappings. In order to abort the search for a homomorphism in unsuitable subsets of the candidate mappings as early as possible, we do not consider a possible subset of candidate mappings further if a mapping (s, t) is in the subset C of candidate mappings, where one child s1 of s and a mapping (s1, t1) ∈ C exists such that t ≠ t1 and t1 is not a subexpression of t. In this case, the constraints imposed by the homomorphism of the subtype relation cannot be fulfilled anymore.

Fig. 11.9
figure 9_11

Function homomorphism for checking if m describes a homomorphism from S to T

The function homomorphism (see Fig. 11.9) expects the types T and S and a mapping m as input. The function first checks if a subexpression of S is mapped to only one subexpression of T [see line (2) of Fig. 11.9], as otherwise the mapping would be ambiguous with some exceptions: One exception is a subexpression with a frequency operator [see line (4) of Fig. 11.9]: if s is a subtype of t, then s is also a subtype of t θ, where θ ∈ {+, *, ?}. Another exception is a subexpression with an or-operator: if s is a subtype of t, then s is also a subtype of t | t’ [see line (4) of Fig. 11.9). The last exception is a subexpression with a union-operator with at least one operand, which allows the empty expression [see line (5) of Fig. 11.9]: if s is a subtype of t, then s can be also a subtype of t ∪ t’ with isNullable(t) or isNullable(t’) holds and we later check whether or not s is really a subtype of t ∪ t’. Afterward, the function homomorphism checks if the type S is mapped to the type T and if all subexpressions SExpr(S) (see Fig. 11.10) of S are mapped to subexpressions of T, fulfilling further constraints checked in the function isMappingOfHomomorphism for each single mapping entry [see lines (7–8) of Fig. 11.9].

Fig. 11.10
figure 10_11

Function SExpr, where A, A1, …, An are type definitions and (s, p, o) is the type of a triple

The function isMappingOfHomomorphism checks if the given mapping m from type S to type T describes a part of a homomorphism from S to T. If S is composed of two subtypes S1 and S2 in an or-relation S1 | S2, then there should exist mappings from S1 and S2 to T [see line (2) of Fig. 11.11] for a subtype relation. If T is composed of two subtypes T1 and T2 in an or-relation T1 | T2, then there should exist at least one mapping from S to T1 or T2 [see line (3) of Fig. 11.11] for a subtype relation.

Fig. 11.11
figure 11_11

Function isMappingOfHomomorphism for checking if m describes a homomorphism from s to t

We present examples for subtype tests between types containing union operations in Fig. 11.12. Example (a) in Fig. 11.12 is the simplest case. In this example, S and T are the union of two subtypes, and each union-operand of S must be a subtype of another union-operand of T. The pictures of the examples (b)–(d) in Fig. 11.12 become more complicated since the union-operator of T can be arbitrarily repeated (they are in the scope of a * or a + operator). In the example (b), several union-operands of S must be subtypes of the same union-operand of T. If T consists of a union of types, which can be arbitrarily repeated (see e.g., (c)), and the union-operator in S has more operands than the union-operator in T, then several pairwise disjoint decompositions of the union-operands of S must be subtypes of T. Furthermore, if an operand of the union-operator of T can be arbitrarily repeated like in example (d), then several union-operands of S can be the subtype of this arbitrarily repeatable union-operand of T. If an operand of the union-operator of T allows the empty triple set like in example (d), then no union-operand of S may be the subtype of this union-operand of T. In order to deal with all these cases, we check the following condition: If S and T are composed of types in a union-relation [line (4) of Fig. 11.11], then there should exist a pairwise disjoint decomposition of the operands [line (6) of Fig. 11.11] of the union operator, such that the number of decompositions is the same as the number of union-operands of T or is a factor of the number of union-operands of T in the case that the subexpression of the union-operator can be arbitrarily repeated [line (5) of Fig. 11.11] as it is in the scope of a * or a + operator (see Fig. 11.13). Furthermore, all these decompositions must be mapped to the corresponding union-operand of T according to the number of repetitions [line (7) of Fig. 11.11)] and all union-operands of T should have a candidate mapping or should allow the empty triple set [line (8); Fig. 11.11].

Fig. 11.12
figure 12_11

Different examples for the subtype test when checking union operations. Here, t1, …, tp are subexpressions of T, s1, …, sn are subexpressions of S, S1’, …, Sk’ are sets of the decomposition, si is a subtype of t((i-1) mod p)+1, and s0 is a subtype of t1

Fig. 11.13
figure 13_11

Function repetition(t, T) for the determination whether or not the subexpression t is part of a subexpression of the type T, which can be arbitrarily repeated

If the types S and T describe constraints for single triples, then each triple element, that is, the subject, predicate, and object, of S must be a subtype of the corresponding triple element of T [lines (9) and (10) of Fig. 11.11]; for example, xsd:long is a subtype of xsd:decimal according to the type hierarchy of XML Schema data types (see Peterson et al. (2009). In the case that T contains an operator +, * or ? [lines (11–14) of Fig. 11.11], we exclude those candidate mappings, the frequency of which is in conflict with a subtype relation in line (12) of Fig. 11.11. A mapping from S to T is not in conflict with a subtype relation, if they have the same frequency (see Fig. 11.14 for the computation of the frequency of a type) or if the frequency of S is lower than the frequency of T, that is, freq(S) <f freq(T), where the transitive relation <f holds for ONE <f ? <f + <f *. Afterward, we check if the corresponding subexpressions are in the mapping m (lines (13 and 14) of Fig. 11.11).

Fig. 11.14
figure 14_11

Function freq, where A and B are type definitions, s is the type of the subject, p of the predicate, and o of the object of a triple

11.3.3 Satisfiability Test of Embedded SPARQL and SPARUL Queries

Erroneous queries often return the empty set for any input, and thus queries are unsatisfiable. Therefore, an unsatisfiable query is a hint for errors in the query. Satisfiability tests of queries can (1) warn the user of the errors in queries, and debug SWOBE programs, thus leading to more stable programs, and (2) precompute the unsatisifiable queries to the empty result at compile time, thus avoiding runtime processing and speeding up the program execution.

Note that SPARUL queries extend the syntax and semantics of SPARQL queries by update queries, such that the below described approaches apply to SPARQL queries and SPARUL queries. For checking the satisfiability of embedded queries, we first transform abbreviations of SPARQL/-UL constructs, that is, predicate-object-lists, object-lists, collections, and the a operator, into their equivalent long forms (see Groppe et al. (2009d). We replace blank nodes by the variables, which are not used somewhere else in the SPARQL/-UL query according to Gutierrez et al. (2004). After this step, each triple pattern of the SPARQL/-UL query has the form e1 e2 e3., where ei is an IRI, a literal (including string and numeric constants) or a variable.

We determine the type D of the input data of the embedded query by a static program analysis. In the example of Fig. 11.15, the satisfiability test and the determination of the query result types are for the embedded SPARQL query in lines (33–39) of Fig. 11.1. The determined type for the variable ?Y is integer and the determined type for the variable ?Z is string. A triple pattern e1 e2 e3 is satisfiable, if the type D of the input data contains types of triples, which intersect with e1 e2 e3.. We can determine all possible types of variables in the triple pattern by checking all types of triples in D, which intersect with the triple pattern e1 e2 e3. Thus, we can use types(e1 e2 e3.) to determine the types of the variables in triple patterns (see Fig. 11.16). If the set of variable types is empty, then this triple pattern is unsatisfiable.

Fig. 11.15
figure 15_11

Example of the satisfiability test and the determination of the query result types for the embedded SPARQL query in lines (33–39) of Fig. 11.1

Fig. 11.16
figure 16_11

The function types determining the types of variables in a SPARQL query, and the function intersect checking if elements of a triple type and a triple pattern intersect. A and B are SPARQL subexpressions, and ei is the type of an element of a triple type or of a triple pattern

The satisfiability of queries can be determined by the function types of Fig. 11.16. Note that sat(Expr, types(A)) is a satisfiability tester for Boolean expressions Expr under data type constraints types(A) of the variables in Expr. Such a satisfiability tester sat(Expr, types(A)) has a high computational complexity (see Cook (1971). However, the results without using such a satisfiability tester sat(Expr, types(A)) for FILTER expressions are typically quite well, such that the application of such a satisfiability tester can be avoided to speed up computation.

If the result of the function types contains the empty set for the types of at least one variable, then the SPARQL/-UL query is unsatisfiable and we can warn the user.

11.3.4 Determination of the Query Result Types

We have already determined the possible types of the result of an embedded SPARQL query when testing the satisfiability. For returning the result by an iterator, we have to determine a super type of these possible types. This super type is then the return type for the iterator method.

We present in Fig. 11.15 the determination of the query result types for the embedded SPARQL query in lines (33–39) of Fig. 11.1. The type for the variable ?Y in the query result is integer and the type for the variable ?Z is string.

Once we have determined the return type, we can generate code for a query result iterator with this return type, such that the type system of java guarantees type safety for the usages of the result. In the example of Fig. 11.15, the java type for the variable ?Y is int and the java type of the variable ?Z is string.

11.4 Summary and Conclusions

We have proposed an approach to supporting the development of more stable Semantic Web applications by embedding the Semantic Web languages RDF/XML, SPARQL, and SPARUL into the java programming language.

Our Semantic Web Objects system (SWOBE) uses a static program analysis in order to guarantee type safety, detect unsatisfiable SPARQL/-UL queries, and determine the types of query results at compile time. In this way, we avoid runtime errors and unexpected behavior of the Semantic Web application. Our implementation of the SWOBE system shows the advantages of our approach as a programming tool.