1 Introduction

RDF (Resource Description Framework) is a metadata model recommended by W3C (World Wide Web Consortium) [1] and is an emerging model. As the de-facto standard of metadata management, RDF data are widely applied, resulting in a large amount of RDF data being available. So, it is increasingly important to efficiently manage massive RDF data [2]. It should be noted that as a metadata model, RDF is applied to model data semantics under an environment of open information extraction (OIE). A typical sample is the construction of a knowledge graph, which is a type of knowledge base represented with RDF, from diverse data sources [3]. In the context of the open environment (e.g., the Web), it can be found that data describing the same entity may be out-of-date, erroneous, or conflicting in different sources [4]. Integrating different data sources can result in imperfect (e.g., inconsistent, imprecise, or uncertain) information, even when the original data are accurate. The traditional RDF model always assumes that RDF data are crisp and reliable, which falls short in the ability to represent and handle uncertain semantics.

Information in the real-world is subject to imperfection. Modeling fuzzy data have been extensively studied in diverse data models, for example, relational databases [5], object-oriented databases [6], XML (eXtensible Markup Language) [7], description logics [8], and ontologies [9]. Ones can refer to [10 and 11] for surveys of modeling fuzzy data in database and conceptual data models, respectively. A survey on modeling fuzzy data in the XML data model is presented in [12]. In addition, some efforts have been devoted to fuzzy RDF data modeling and management. These efforts mainly concentrate on the following aspects: fuzzy RDF models (e.g., [13,14,15,16]), fuzzy RDF reasoning (e.g., [17, 18]), fuzzy RDF persistence in databases (e.g., [19,20,21]), and fuzzy RDF querying [22, 23] (including query language [24,25,26], subgraph matching [27,28,29,30] and others [4]).

Aggregate functions usually perform calculations on a set of values and return a single summarized value. Aggregations are important for various applications, such as data mining, decision-making, and data analytics. In the context of databases and XML [31,32,33], some aggregate operations have been provided to SQL, a standard query language for relational databases, and XQuery, a standard query language recommended by W3C for XML. Such aggregate operations allow for summing up large volumes of data and are very useful in many scenarios, such as information retrieval and decision-making. For the RDF data model, SPARQL (Simple Protocol and RDF Query Language), recommended by W3C, is the predominant RDF data query language. More importantly, the latest version of SPARQL 1.1 specification explicitly supports applying aggregation operations in RDF data querying [34]. Aggregate functions such as Min, Max, Avg, Sum, and Count are very important for value computation and RDF data management.

Aiming to make aggregate functions handle uncertain values, aggregate queries over uncertain data have gained a great deal of attention in the context of uncertain databases [35,36,37,38,39] and uncertain XML data [40,41,42]. Compared with fuzzy XML, fuzzy RDF has a richer syntax framework and is recently used as the carrier of uncertain knowledge graphs or graph structure. Therefore, it is essential to use aggregation query over fuzzy RDF for uncertain data analysis. However, to the best of our knowledge, there are few efforts on aggregate queries over fuzzy RDF data, and there are no aggregate functions for fuzzy RDF data that can handle fuzzy values. This paper tries to fill this gap. We investigate fuzzy RDF queries with SPARQL that contains the fuzzy aggregation functions. This paper introduces a fuzzy RDF graph model and then presents a fuzzy SPARQL language for flexible RDF querying. On this basis, we discuss the query with aggregation function over fuzzy RDF and explore the processing of fuzzy values in an aggregate query, then propose three variants of typical aggregate functions (Min, Max, Avg, Sum, and Count) over fuzzy RDF, which are lower bound aggregation, upper bound aggregation, and expectation aggregation, respectively. We formally define these variant aggregate functions from two perspectives: full aggregation and group aggregation. Particularly, we propose the approaches for implementing these variant aggregate functions in the context of fuzzy RDF graphs based on the two strategies of translation and storage procedure.

The remainder of the paper is organized as follows. Section 2 presents related work. We give preliminaries of the aggregate operations of SPARQL and the fuzzy RDF data model in Sect. 3. In Sect. 4, we propose three variants of the five aggregate functions for SPARQL and investigate how to deal with the aggregate operations over the fuzzy RDF model. Section 5 provides an experimental evaluation of fuzzy aggregate queries and proves that the approach proposed in this paper can effectively implement aggregate queries over fuzzy RDF. Section 6 concludes the paper.

2 Related work

This paper concentrates on queries with aggregate functions over fuzzy RDF data, which is closely related to the aggregate operations over uncertain data, the query with SPARQL, and subgraph matching based on the fuzzy RDF model.

2.1 Aggregate operations over uncertain data

A significant problem in uncertain databases is the ability to answer aggregate queries. The aggregation problem is considered in [35] using an imprecise probabilistic relational database model, in which imprecision and uncertainty are represented by partial probabilities and probability distributions, respectively. By utilizing the Kullback–Leibler information divergence between the aggregated probability distribution and the individual tuple values, aggregation operators over probabilistic relational databases can uncover uncertain attribute properties. Focusing on the probabilistic relational database model with the Dempster–Shafer representation of evidence, a mechanism based on the aggregation of evidence is provided in [36], which can resolve inconsistencies. In order to provide information about the properties and patterns of the data, the aggregation operators are developed according to an attribute-driven approach. In [37], aggregation in uncertain and probabilistic databases is investigated in a database management system called Trio, which can store and query uncertain and lineage data. To solve the problem of exponentially sized results in “exact” aggregation in uncertain databases, three alternatives are provided in [37]: a low bound on the aggregate value, a high bound on the value, and the expected value, respectively. These variants return a single value rather than a set of possible values, and they can be applied to full-table aggregation queries as well as grouped aggregation queries. The formal definitions, semantics, and approach for single-table aggregation queries are provided in [37]. To solve the problem of exponentially sized aggregation results, probability-based entropy is introduced in [39] as a kind of aggregate function. On this basis, probability-based aggregation and probabilistic entropy functions are used to enhance aggregation in uncertain databases. In [38], aggregate queries are evaluated in a fuzzy relational database framework with possibilistic certainty. In this kind of uncertain database model, a notion called necessity is applied, which qualifies the certainty that ill-known data takes a given value or belongs to a given subset. For three aggregation operators, avg, sum, min, and max, three cases are considered in [38], which are precise aggregate over uncertain data, imprecise aggregate over crisp data, and imprecise aggregate over uncertain data. As a result, an aggregate may be an uncertain value, a disjunction of uncertain values/intervals, or a fuzzy set.

Few efforts have been made to deal with aggregate queries in the uncertain XML model. In [41, 42], a model unifying the various (discrete) semi-structured probabilistic models is applied. Based on the probabilistic XML model, algorithms for computing the distribution of the aggregation values and probabilistic moments (e.g., expectation and variance) of this distribution are proposed. This discrete model is further extended to a continuous one so that continuous data values can be considered. Then algorithms to compute distribution functions and moments for various classes of continuous distributions of data values are also presented.

2.2 Fuzzy RDF

We can identify three types of fuzzy RDF models. Let (s, p, o) be an RDF triple, where s is a subject, p is a predicate, and o is an object. First, in the fuzzy RDF model proposed in [13, 16], each triple is annotated with a membership degree in [0, 1] and has the form of (s, p, o, μ) (μ ∈ [0, 1] is a membership degree). Second, in the fuzzy RDF model proposed in [14, 15], three components of a triple are annotated with separate membership degrees in [0, 1], respectively, and a fuzzy RDF triple has the form of (μs/s, μp/p, μo/o). Here, μs ∈ [0, 1] denotes the membership degree of a fuzzy subject s to the fuzzy RDF graph; μp ∈ [0, 1] indicates a fuzzy predicate p related to the fuzzy occurrence of the corresponding subject; μo ∈ [0, 1] represents the degree of a fuzzy object o to the appointed property. The syntax and semantics of the fuzzy RDF model are presented in [14]. The fuzzy RDF syntax is mainly based on fuzzy RDF/XML syntax, and the fuzzy RDF semantics contains fuzzy interpretation and fuzzy entailment. For the fuzzy RDF graph, three types of algebraic operations are defined in [15]. Third, in the fuzzy RDF model proposed in [19], only the object component of triple is annotated with a membership degree and has the form of (s, p, (o, µ)).

To query fuzzy RDF data with SPARQL, fuzzy conditions in the FILTER clause, fuzzy regular graph patterns that include fuzzy regular expressions over paths, and fuzzy quantifiers are introduced in [24]. Such an extended SPARQL language is called FURQL (Fuzzy RDF Query Language), which uses fuzzy extensions of the SPARQL graph patterns to express fuzzy preferences on the entities of a fuzzy RDF graph (through fuzzy conditions) and on the structure of the graph (through fuzzy regular expressions). With the FURQL, the issue of integrating fuzzy quantified queries of the type “Q B X are A” into the FURQL is investigated in [26], and a query processing strategy based on the derivation of non-quantified fuzzy queries is proposed. In [25], the FURQL is enhanced with a user profile to support the personalization of fuzzy SPARQL queries over fuzzy RDF databases. To increase the discrimination power of the interaction, a model is proposed in [4], which allows formulating of complex queries gradually by browsing and exploring fuzzy RDF.

There have been some efforts to aggregate queries over uncertain data. However, effective aggregate queries over fuzzy RDF are still an open question. Some work is devoted to fuzzy RDF queries with SPARQL, and as a result, fuzzy SPQRQL syntax is formally defined. On this basis, aiming to implement the aggregation query over fuzzy RDF, an extension of SPARQL to query fuzzy data (like RDF) is presented in [43], called FSA-SPARQL. FSA-SPARQL involves different fuzzy connectives, fuzzy operators such as VERY, MORE_OR_LESS, CLOSE_TO, AT_MOST, and AT_LEAST, as well as fuzzy aggregators such as MEAN, WSUM, WMIN, and WMAX. Such extensions can express fuzzy queries that use fuzzy connectives and (aggregation) operators and implement aggregation queries based on standard SPARQL over fuzzy RDF. Note that fuzzy aggregate query languages like FSA-SPARQL do not fully consider the impact of fuzzy values on an aggregate query. They only use fuzzy values as a threshold to filter query results and cannot directly describe the fuzzy characteristics of data in the query results. This paper tries to investigate a more extensible approach, which can aggregate the fuzzy values contained in query results and also analyze the upper and lower bounds and expectations of the fuzzy values so that the different requirements of aggregation queries over fuzzy RDF data can be met.

3 Preliminaries

3.1 RDF model and SPARQL language

RDF is a metadata model for representing and exchanging semantic information. Using metadata about information sources, the RDF model can provide a description of the relationships among resources in terms of attributes and values that are uniquely identified. The universe of the RDF data model is defined as a set of resources. A resource can be anything with a universal resource identifier (URI), described with some RDF statements in the form of triples. We can identify two types of RDF syntax: RDF abstract syntax [1] and RDF/XML syntax [44], where the RDF/XML syntax encodes RDF in XML. The abstract syntax of the RDF model is a set of triples. This paper pays attention to the abstract representation of the RDF model with graphs. Formally, let \(\mathcal{V}\) be a set of vocabulary partitioned in three pairwise disjoint sets \(\mathcal{U}\), \(\mathcal{B}\) and \(\mathcal{L}\). Here \(\mathcal{U}\), \(\mathcal{B}\) and \(\mathcal{L}\) denote a set of URI references (or URIrefs), a set of blank nodes, and a set of literals, respectively. Moreover, \(\mathcal{L}\) is further partitioned into two disjoint sets: \(\mathcal{L}_\mathcal{P}\) (a set of plain literals) and \(\mathcal{L}_\mathcal{T}\) (a set of typed literals). Here we have \(\mathcal{U}\)\(\mathcal{B}\)\(\mathcal{L}_\mathcal{P}\)\(\mathcal{L}_\mathcal{T}\) = Φ.

Definition

(RDF triple): Let t be an RDF triple, and we have t ∈ (\(\mathcal{U}\)  ∪\(\mathcal{B}\)) × \(\mathcal{U}\) × \(\mathcal{V}\). The first component is called the subject, its second is called the predicate (or property), and its third is called the object.

A triple (s, p, o) declares a statement, and its interpretation is that s has a property p with the value o. A set of RDF triples forms an RDF model, which is represented by an RDF graph.

Definition

(RDF graph): An RDF graph is a labeled and directed graph that is represented as G = (N, E, Σ, L), where N is a finite set of nodes, E ⊂ N × N is a set of directed edges, Σ is a set of labels, and L: N ∪ E → Σ is a function that assigns labels to nodes and edges, respectively.

A set of RDF triples forms an RDF graph, in which the nodes represent subjects or objects of triples, and the oriented edges from subject nodes to object nodes correspond to the predicates of triples.

SPARQL is a standard querying language for RDF data. Like SQL language for relational databases, SPARQL query has the basic format of SELECT-FROM-WHERE. Here, the clause SELECT represents a set of variables shown in the returned result; the clause FROM represents the dataset which is constituted by a set of IRIs of RDF documents; the clause WHERE represents the querying condition. Note that the querying condition in SPARQL is described by a graph pattern. A basic graph pattern consists of a set of triple patterns and is the simplest graph pattern. A triple pattern is a triple with variable(s) inside. With the basic graph patterns, we can further construct combinatorial graph patterns, optional graph patterns, and multiple graph patterns as more complex querying conditions.

SPARQL 1.1 specification supports aggregate functions, including Min, Max, Avg, Sum, and Count. The formal representations of these aggregate functions are.

Min/Max/Sum ([ALL|DISTINCT] expression) and.

Count ([ALL|DISTINCT] identification-variable)/Count (*).

The clause DISTINCT is used to eliminate the repetition values, and ALL is a default option, which does not eliminate the repetition values. Other than Count, all the other aggregate functions ignore null values. Two return types of these aggregate functions can be identified: basic type and wrapper type. Min and Max functions can be operated on data types such as number, string, or date/time, and their returned values have the corresponding data types. The functions Avg and Sum can only be operated on the number data type, and their return values also have the number data type. In addition, the function Count can be operated on any data type and get a returned value of the integer data type. Note that Min, Max, Avg, or Sum will return a null value when they are applied to an empty set, and Count will return 0 when it is used for an empty set.

3.2 Fuzzy sets and fuzzy RDF graph

Fuzzy sets introduced originally by Zadeh [45, 46] are an important means of computational intelligence [47]. Let F be a fuzzy set in a universe of discourse U. For F, we need to define a membership function μF: U → [0, 1], in which μF (u) (u ∈ U) denotes the membership degree in which u belongs to F. Then F can be formally represented as follows.

$$ F\, = \,\left\{ {\left( {u_{1} ,\mu_{F} \left( {u_{1} } \right)} \right),\left( {u_{2} ,\mu_{F} \left( {u_{2} } \right)} \right), \ldots ,\left( {u_{n} ,\mu_{F} \left( {u_{n} } \right)} \right)} \right\} $$

Like a classical set C = {u1, u2, …, un}, F is a set of elements. However, each element in F is a pair of values (say (ui, μF (ui))) instead of a single value (say ui) in C (ui ∈ U). This means that ui may belong to F, but it is not certain. At this point, a membership degree is necessary, which is explicitly represented by μF (ui) (0 ≤ μF (ui) ≤ 1). In C, ui must belong to C, and its membership degree to C is exactly 1. It is clear that when all membership degrees in F are exactly 1, F can reduce to a conventional set (say C).

To represent and handle fuzzy RDF data, a fuzzy RDF data model is proposed in [28], where the object component of a triple may be annotated by a degree in [0, 1] and can model the fuzziness of objects in RDF triples. A triple with the format of (s, p, (o, µ)) is called a fuzzy RDF triple (statement), in which s, p, and o are, respectively, the subject, predicate, and a fuzzy object; µ ∈ [0, 1] denotes the membership degree. A set of fuzzy RDF triples can be represented by a fuzzy RDF graph.

Definition

(Fuzzy RDF graph) [28]: A fuzzy RDF graph is a labeled and directed graph denoted as GF = (N, E, Σ, L, μ). Here N is a finite set of nodes, E ⊂ N × N is a set of directed edges, Σ is a set of labels, L: N ∪ E → Σ is a function assigning labels to nodes and edges respectively, here μ: E → [0, 1] is a fuzzy subset.

Let GF be a fuzzy RDF graph, v be a node of GF, and L(v) be the label of v. In a fuzzy RDF triple, L(v) can be a label of a subject/object. We identify two types of nodes: subject nodes and object nodes. For a subject node vi and an object node vj in GF, (vi, vj) ∈ E means a directed edge from vi to vj, which has a label L(vi, vj). L(vi, vj) is actually a predicate of a fuzzy RDF triple. For an object node vj with label L(vj), it is possible that L(vj) is associated with a membership degree (i.e., µj), indicating the degree that vj has the label L(vj) with respect to L(vi, vj). To clearly represent the degree that vj has the label L(vj) with respect to L(vi, vj), in the fuzzy RDF graph [28], a membership degree is assigned to L(vi, vj) instead of L(vj). Note that a crisp RDF graph can be regarded as a special case of a fuzzy RDF graph. A fuzzy RDF graph will turn to a crisp one if, for all (vi, vj) ∈ E, we always have μ: E → {0, 1} instead of μ: E → [0, 1].

A sample of the fuzzy RDF data model presented in [28] is shown in Fig. 1, which includes 21 fuzzy triples and 13 predicates. In the fuzzy RDF graph, a fuzzy predicate is denoted by a notation “/” and a membership degree within (0, 1]. This expression connects a subject and an object, representing a fuzzy triple with a membership degree. Such an expression can be shortened to only a predicate when its membership degree is 1.

Fig. 1
figure 1

Fuzzy RDF triples and fuzzy graph [28]

4 Aggregate functions over fuzzy RDF data

In the typical SPARQL, there are five aggregate functions, which are Min, Max, Avg, Sum, and Count. To deal with fuzzy RDF data with SPARQL, we follow the step of [37] and provide three variants of the aggregate functions: the lower bound of aggregation, the upper bound of aggregation, and the expected aggregation. For these variants of aggregate functions, we further specify them from two perspectives: full aggregation and group aggregation, depending on if they operate on all or partial triples of the given fuzzy RDF dataset. The GROUP BY clause of SPARQL describes the partial triples of the fuzzy RDF dataset. We propose the processing approaches for five aggregate functions with variants according to the three strategies: exhaustion, translation, and storage procedure.

4.1 Identifying variants of aggregate functions

4.1.1 Variants of full aggregation

For each of the five aggregate functions Min, Max, Avg, Sum, and Count, three variants are introduced, which are the lower bound of aggregation, the upper bound of aggregation, and the expected aggregation, respectively. Viewed from the returned values of these three variants, we have the lower bound of aggregation ≤ the expected aggregation ≤ the upper bound of aggregation. Let QA be an aggregation query on a set of fuzzy RDF triples U.

First, we have five aggregate functions for the lower bound of aggregation, which are denoted by L-Count, L-Sum, L-Avg, L-Min, and L-Max, respectively. Each returns an element, which is the non-empty option of minimum quantifiers in QA. Note that for Min, Max, Avg, and Sum, QA can only have an empty element if this element is a numeric data type. For empty instances, Count returns 0, which means that such instances are not accounted for. When the only option in QA is empty, the lower bound of aggregation will return NULL. Then we have L-Count, L-Sum, L-Avg, L-Min, and L-Max.

Example 1.

Figure 2 presents an example of fuzzy RDF, and we would use the SPARQL query to retrieve the shortest average squirrel length. Suppose we have the following query on the fuzzy RDF shown in Fig. 2.

$$ \begin{aligned} & {\text{SELECT}}\;L {\text{-}} Avg\left( {?length} \right){\text{ WHERE}}\{ \\ & \quad ?{\text{animal species squirrel}}. \\ & \quad ?{\text{animal length }}?length. \\ & \} \\ \end{aligned} $$
Fig. 2
figure 2

An example of fuzzy RDF

Finally, we have a result of 18 because the length of annimal1 is unusable, the length of animal2 is 16, and the length of animal3 is 20.

Similarly, we have five aggregate functions for the upper bound aggregation, denoted by U-Count, U-Sum, U-Avg, U-Min, and U-Max, respectively. Each returns an element, which is the non-empty option of maximum quantifiers in QA.

Example 2.

Figure 2 presents an example of fuzzy RDF, and we apply the SPARQL query to retrieve the longest average squirrel length. Suppose we have the following query over the fuzzy RDF shown in Fig. 2.

$$ \begin{aligned} & \quad {\text{SELECT}}\;U {\text{-}} Avg\left( {?length} \right){\text{ WHERE}}\{ \\ & \quad ?{\text{animal species squirrel}}. \\ & \quad ?{\text{animal length }}?length. \\ & \} \\ \end{aligned} $$

Then we have a result of 19.33 because the length of annimal1 is 20, the length of animal2 is 18, and the length of animal3 is 20.

Finally, we have five aggregate functions for the expected aggregation, denoted by E-Count, E-Sum, E-Avg, L-Min, and L-Max, respectively. The value of an expected aggregation is a weighted average of all non-empty options in the corresponding exhaustive results of QA. Here the weights come from the membership degrees of the corresponding fuzzy RDF triples.

Example 3.

For the example of fuzzy RDF in Fig. 2, we have the following query on the fuzzy RDF in Fig. 2.

$$ \begin{aligned} & \quad {\text{SELECT}}\;E {\text{-}} Avg\left( {?length} \right){\text{ WHERE}}\{ \\ & \quad ?{\text{animal species squirrel}}. \\ & \quad ?{\text{animal length }}?length. \\ & \} \\ \end{aligned} $$

Then we have a result of 19.16, it can be calculated as 19.33 × 0.72 + 18.67 × 0.18 + 19 × 0.08 + 18 × 0.02 = 19.16.

4.1.2 Variants of group aggregation

Now we investigate five aggregate functions which operate on the triples described by the GROUP BY clause of SPARQL. A clause of GROUP BY contains a list of grouping attributes, denoted G. For each value g in G that occurs in the returned results before the aggregation operation, the result of a group aggregation query is to produce an element. Let QA be an aggregation query. Suppose that G only contains an attribute and QA only contains an aggregate function. Let U be a set of fuzzy RDF triples. Then, a group aggregation at least generates a result for each g in G, which occurs in a possible instance. Logically speaking, its result is a mapping σ(G=g) (U) of the results returned by the corresponding full aggregation, which is further corrected for possible empty instances. At this point, the notation Φ is used rather than 0 to represent such an option, which means that group G = g does not appear in all the possible instances.

Now, we illustrate two variants of group aggregation with examples: the lower bound of aggregation and the upper bound of aggregation. Suppose we have the following query on the fuzzy RDF shown in Fig. 2.

$$ \begin{aligned} & \quad {\text{SELECT}}\;L {\text{-}} Avg\left( {?{\text{length}}} \right),\;U {\text{-}} Avg\left( {?length} \right){\text{ WHERE}}\{ \\ & \quad ?{\text{animal species squirrel}}. \\ & \quad ?{\text{animal length }}?length. \\ & \quad ?{\text{animal color }}?color. \\ & \quad {\text{GROUP BY }}?color \\ & \} \\ \end{aligned} $$

This SPARQL query obtains the lower and upper average lengths of squirrels according to their colors. Then we have the result shown in Table 1.

Table 1 Lower bound and upper bound of group Avg aggregation

Similarly, we illustrate the expected group aggregation with examples. Suppose we have the following query on the fuzzy RDF shown in Fig. 2.

$$ \begin{aligned} & \quad {\text{SELECT E}} {\text{-}} {\text{Avg }}\left( {?{\text{length}}} \right) \\ & \quad ?{\text{animal species squirrel}}. \\ & \quad ?{\text{animal length }}?{\text{length}}. \\ & \quad ?{\text{animal color }}?{\text{color}}. \\ & \quad {\text{GROUP BY }}?{\text{color}} \\ & \} \\ \end{aligned} $$

This SPARQL query obtains the expected average lengths of squirrels according to their colors. Then we have the result shown in Table 2.

Table 2 Expected group Avg aggregation

4.2 Defining variants of aggregate functions

We have three variants for the five aggregate functions Min, Max, Avg, Sum, and Count, which are the lower bound of aggregation, the upper bound of aggregation, and the expected aggregation, respectively. With these five aggregate functions, called aggregation by enumeration, we have twenty aggregate functions in total. Each aggregate function can be further identified as whole aggregation and grouping aggregation. In this case, the grouping aggregation is a direct extension of the whole aggregation. Viewed from the processing of these twenty aggregate functions, we have two basic strategies, which are translation and storage procedures. The relationships between the aggregate functions and the processing strategies are shown in Table 3.

Table 3 Aggregate functions and their processing

It is shown in Table 3 that translation can be applied to ten aggregate functions, and the storage procedure can be applied to nine aggregate functions. Here, translation means that a query with variants of aggregate functions can be translated to a semantics-preserving query with normal aggregate functions. Note that the translation or storage procedure cannot directly deal with the E-Avg aggregate function. We can approximate it by dividing E-Sum by E-Count.

4.2.1 Aggregate functions by enumeration

For a fuzzy RDF triple, say (s, p, (o,)), the aggregate function Count by enumeration is evaluated as 1 if s has the value o on p. For this purpose, we apply SECLECT ?var WHERE{?var attribute value} to get the corresponding triples. Then, we can get the number of triples that satisfy a binary matching {p = attribute, o = value}. When the number is 1, the membership degree is μ; when the number is 0, the membership degree is 1 − μ. Finally, for all fuzzy RDF triples, the sum of these numbers and the product of the corresponding membership degrees are the value of Count and its membership degree, respectively.

The aggregate functions Min and Max by enumeration are relevant to the existence of options in the corresponding triples. For triples with the form (s, p, (o, μ)), we would get a maximum o, which means that there are no other triples whose objects are more than o. The first step is to obtain a membership degree μ of o. We also need to calculate the membership degrees of other values being less than o. The membership degree that o is the maximum should be a product of these membership degrees.

For the aggregate functions Sum and Avg by enumeration, we still need to calculate the number of options for each triple, which meets the binary matching above. The basic idea of aggregate functions Sum and Avg by enumeration is to materialize all possible triples from the original fuzzy triples and form an intermediate fuzzy RDF graph, which is applied to evaluate the aggregate functions Sum and Avg. In case there are too many options, it is challenging to efficiently evaluate the aggregate functions Sum and Avg by enumeration. For this reason, in the following section, we introduce the aggregate functions by translation.

4.2.2 Aggregate functions by translation

Different from the aggregate functions by enumeration, the aggregate functions by translation adopt a strategy of rewriting the original variants of aggregate functions. For the variants of the lower bound of aggregation and the upper bound of aggregation, the translations choose the options that meet the requirements of each triple in the dataset. With all possible instances, the possible instances that have the minimum/maximum aggregate value are then constructed. For the variant of the expected aggregation, the translation uses the linear feature of expectation [48]. Then, based on each option and its membership degree, its contribution to the expected aggregation of all possible instances can be calculated.

In the following, we call the triples without option Φ (i.e., o in (s, p, (o, μ)) must be non-empty) definite triples and call the triples with option Φ indefinite triples. For a concrete variant of aggregate functions, we present how it is translated into a corresponding standard SPARQL query. We focus on discussing major aggregate functions and will not discuss the similar aggregate functions here.

  1. (1)

    L-Count

    L-Count is applied to count the least occurrence of a definite triple. So, its translation only calculates the number of definite triples. We illustrate the translation of L-Count as follows.

    $$ \begin{gathered} {\text{Query with variant }}L {\text{-}} Count: \;{\text{SELECT }}L {\text{-}} Count\left( {?{\text{var}}} \right){\text{ WHERE }}\left\{ {?{\text{var attribute }}?sth.} \right\} \hfill \\ {\text{Translation}}:\;{\text{SELECT }}Count\left( {{\text{DISTINCT }}?{\text{var1}}} \right) {\text{-}} Count\left( {{\text{DISTINCT }}?{\text{var2}}} \right){\text{ WHERE}} \hfill \\ \{ ?{\text{var1 attribute }}?sth. \, ?{\text{var2 attribute }}\Phi \} \hfill \\ \end{gathered} $$
  2. (2)

    U-Count

    U-Count is applied to count the most occurrence of a definite triple. So, its translation calculates the number of all triples. We illustrate the translation of U-Count as follows.

    $$ \begin{gathered} {\text{Query with variant }}U {\text{-}} Count: \;{\text{SELECT }}U {\text{-}} Count\left( {?{\text{var}}} \right){\text{ WHERE }}\left\{ {?{\text{var attribute }}?sth.} \right\} \hfill \\ {\text{Translation}}:\;{\text{SELECT}}(Count\left( {{\text{DISTINCT }}?{\text{var}}} \right){\text{ as }}?{\text{count}}){\text{ WHERE}}\left\{ {?{\text{var attribute }}?{\text{sth}}.} \right\} \hfill \\ \end{gathered} $$
  3. (3)

    L-Min

    L-Min is applied to find out such a fuzzy triple (s, p, (o, μ)) that has the minimum value o compared with all other fuzzy triples. Similarly, U-Max is applied to find the fuzzy triple (s, p, (o, μ)) that has the maximum value o compared with all other fuzzy triples. In the following, we only illustrate the translation of L-Min.

    $$ \begin{gathered} {\text{Query with variant }}L {\text{-}} Min: \;{\text{SELECT }}L {\text{-}} Min\left( {?x} \right){\text{ WHERE }}\left\{ {?{\text{var }}x?x} \right\} \hfill \\ {\text{Translation}}:\;{\text{SELECT }}Min\left( {?x} \right){\text{ WHERE }}\left\{ {?{\text{var }}x?x} \right\} \hfill \\ \end{gathered} $$
  4. (4)

    L-Max

    Note that it is possible that there are only indefinite triples. Here, L-Max is applied to find such a fuzzy triple (s, p, (o, μ)) that has the minimum value o compared with all other fuzzy triples. Also, it is possible that there are some definite triples. Here, L-Max is applied to find such a definite fuzzy triple (s, p, (o, μ)) that has the maximum value o compared with all other fuzzy triples. Similarly, we can deal with U-Min. We illustrate the translation of L-Max in Algorithm 1.

    $$ \begin{gathered} {\text{Query with variant }}L {\text{-}} Max: \;{\text{SELECT }}L {\text{-}} Max\left( {?x} \right){\text{ WHERE }}\left\{ {?{\text{var }}x?x} \right\} \hfill \\ {\text{Translation}}:{\text{ we present the following algorithm for }}L {\text{-}} Max\;{\text{translation}}. \hfill \\ \end{gathered} $$
    figure a
  5. (5)

    L-Sum

    To calculate the minimum Sum aggregate function, we need to sum the minimum values of o in all definite triples of the form (s, p, (o, μ)). For the indefinite triples, L-Sum is the minimum o value of these indefinite triples. For this purpose, we need to get the possible instances with this minimum value. We illustrate the translation of L-Sum in Algorithm 2.

    figure b

    We can deal with U-Sum just like L-Sum. The maximum values of o are summed in the definite triples with the form (s, p, (o, μ)). If there are no definite triples, U-Sum is the maximum value of elements in all indefinite triples.

  6. (6)

    E-Count

    The expected Count is a weighted sum of the counts in all possible instances, in which the weights are the membership degrees μ of the corresponding possible instances. For a triple, the membership degree of the option is the membership degree sum of all the possible instances containing the option.

    $$ \begin{gathered} {\text{Query with variant }}E {\text{-}} Count: \;{\text{SELECT }}L {\text{-}} Count\;\left( {?{\text{var}}} \right){\text{ WHERE }}\left\{ {?{\text{var attribute }}?sth.} \right\} \hfill \\ {\text{Translation}}:\;{\text{SELECT }}Sum\;\left( {?degree} \right){\text{ WHERE }}\left\{ {?{\text{var attribute}}/?degree?sth.} \right\} \hfill \\ \end{gathered} $$

    Here, ?degree is the membership degree of the corresponding attribute. After getting the triples with form (s, p, (o, μ)) that satisfy the given condition, we sum their membership degrees μ and then have E-Count.

  1. (7)

    E-Sum

    Like E-Count, we calculate the expected sum by summing all non-empty options as well as weighting their membership degrees. However, it is not appropriate for the possible empty instances to be included in the expected sum because, without any compensation, the sum of possible empty instances is regarded as 0. In this paper, we compensate for the possibility of possible empty instances by scaling the weighted sum proportionally, which is a product of membership degrees in the indefinite triples. We illustrate the translation of E-Sum in Algorithm 3.

    $$ \begin{gathered} {\text{Query with variant }}E {\text{-}} Sum: \;{\text{SELECT }}E {\text{-}} Sum\;\left( {?x} \right){\text{ WHERE }}\left\{ {?y\;x?x.} \right\} \hfill \\ {\text{Translation}}:{\text{ we present the following algorithm for}}\;E {\text{-}} Sum\;{\text{translation}}. \hfill \\ \end{gathered} $$
    figure c

    In Algorithm 3, Q1 is applied to calculate the sum of the weighted non-empty options, and Q2 uses the form of E-Count to sum the membership degrees of the non-empty options. Finally, we can calculate the average of sums by scaling the weighted sum proportionally.

    1. (8)

      U-Max with grouping

      As shown above, without grouping, U-Max is the maximum value of o in all triples with form (s, p, (o, μ)). Being a natural extension of the U-Max without grouping, U-Max with grouping only chooses the maximum value within each group.

      $$ \begin{gathered} {\text{Query with variant }}U {\text{-}} Max: \;{\text{SELECT}}\;U {\text{-}} Max\left( {?x} \right){\text{ WHERE }}\left\{ {?y\;x\;?x. \, ?y\;z\;?z.{\text{ GROUP BY }}?z} \right\} \hfill \\ {\text{Translation}}:\;{\text{SELECT }}Max\;\left( {?x} \right){\text{ WHERE }}\left\{ {?y\;x\;?x. \, ?y\;z\;?z.{\text{ GROUP BY }}?z} \right\} \hfill \\ \end{gathered} $$
    2. (9)

      U-Sum with grouping

      First, we look at the U-Sum without grouping. Like L-Sum, to calculate the Sum aggregate, we sum the maximum values of o in the definite triples with form (s, p, (o, μ)). If there are no definite triples, U-Sum is the maximum value of o in all indefinite triples. We use the processing above in each group for the U-Sum with grouping. Note that the U-Sum can be determined only when the element for grouping appears in the definite triples and there is only one option. We illustrate the translation of U-Sum with grouping in Algorithm 4.

      $$ \begin{gathered} {\text{Query with variant }}U {\text{-}} Sum: \;{\text{SELECT}}\;U {\text{-}} Sum\left( {?x} \right){\text{ WHERE }}\left\{ {?y\;x\;?x. \, ?y\;z\;?z.{\text{ GROUP BY }}?z} \right\} \hfill \\ {\text{Translation}}:{\text{ we present the following algorithm for}}\;U {\text{-}} Sum\;{\text{translation}}. \hfill \\ \end{gathered} $$
      figure d

We can deal with other grouping aggregations like the U-Max with grouping and the U-Sum with grouping above. Here we need to determine if the element values in the grouping occur in the triples. Alternatively, the extra null values are included in the calculation, and the returned results show deviation.

4.2.3 Aggregate functions with storage procedure

In this section, we discuss how to deal with L-Avg/U-Avg and E-Min/E-Max with storage procedures from two perspectives: full aggregation and group aggregation.

  1. (1)

    L-Avg/U-Avg full aggregation

The full aggregation of L-Avg is to construct possible instances with the minimum Avg value. Here we apply an incremental approach to building possible instances. We first calculate the “definite” average value for the definite triples, which is the average of the minimum object values of these definite triples (as shown in queries Q1 and Q2 in Algorithm 8 below). Second, for the indefinite definite triples with membership degrees, we consider the value of the minimum option in each indefinite triple and determine the option that can reduce the average. Here we sort these values in ascending order (as shown in query Q3 in Algorithm 8 below). The process can stop when the average stops falling. The full aggregation of U-Avg is very similar to the full aggregation of L-Avg. We present the algorithm for L-Avg full aggregation as follows.

figure e
  1. (2)

    E-Min/E-Max full aggregation

The full aggregation of E-Min considers all options of the elements involved by the aggregation, which are sorted in ascending order until at least a fuzzy triple (s, p, (o, μ)) without option Φ is found, or all values are traversed. This can ensure that there is at least a value in each possible instance that is within the scope considered by the aggregation. That is, in any possible instance, there are other values that can be Min. The full aggregation of E-Max is very similar to the full aggregation of E-Min. Their difference is that E-Max sorts the options in descending order rather than ascending order. We present the algorithm for E-Min full aggregation as follows.

figure f

For each value v, Algorithm 6 calculates an overall membership degree of possible instances, which contain the options with v, and there are no values lower than v. With Algorithm 6, we use the weighted average and get the expected Min. Note that we only have a possible empty instance when there are no definite triples to be found. At this point, the expected Min must be offset. In Algorithm 6, after the first loop is over, the variable cval is the value of the expected aggregation of Min. Here, the value is not scaled to offset a possible empty instance if it exists. The variable cum_degree is an array, which is applied to calculate all non-empty membership degrees of the involved triples. When the loop is over, and there is a possible empty instance, the membership degrees of empty elements in the triples are multiplied, and the product is used by the array to calculate the possibility of this possible empty instance. Then cval is scaled accordingly.

Algorithm 6 shows that the proposed approach can be developed by linearly scanning the results of the ORDER BY query. Also, an additional space is needed to add the membership degrees until the covered triples are found. In the worst case, an additional space of O(n) is needed for n triples.

We can further propose algorithms that are very similar to the algorithms for the full aggregation above to deal with the group aggregation of L-Avg, U-Avg, E-Min, and E-Max. The difference is that the group aggregation only operates on a given group of triples instead of whole triples. So, in this paper, we do not present the algorithms for the group aggregation of L-Avg, U-Avg, E-Min, and E-Max.

5 Experiments

All 20 variants of aggregate functions proposed in this paper have been implemented on the basis of the fuzzy RDF data model and the standard SPARQL language, aiming to make the standard aggregate function in SPARQL1.1 more flexible and capable of expressing different query intentions over fuzzy RDF. Different aggregate function variants have pros and cons in different application scenarios, and this section evaluates the feasibility and performance of the proposed aggregate function variants through experiments. Firstly, the experiment's running environment and data sets are introduced, and then the running time of function queries under different data sets is analyzed and discussed.

The prototype runs on a Windows 10 Professional System with an i7-9700 3.0G CPU and 32G RAM. Our experiment used the warm cache to measure the running time. That is, a query is performed first, and then the average time of the successive three same query operations is taken as the query running time.

The datasets used in this experiment came from DBpedia, a part of Linked Data and one of the world's most extensive multi-domain ontologies. Due to most of the data processed by the aggregate query being related to numerical data, our experiment uses the geographical coordinate data extracted from DBpedia, which is mainly composed of city, longitude, and latitude information. All the information in the original datasets is accurate. For the purpose of our experiment, ambiguity was randomly appended to the predicate of triples, and five fuzzy RDF datasets were constructed: dataset1, dataset2, dataset3, dataset4, and dataset5. The constructed five datasets have different sizes, and dataset5 is ten times larger than dataset1. The details of these five datasets are shown in Table 4.

Table 4 Description of datasets

The first group of experiments measured the running time of the query of aggregate functions proposed in this paper on different datasets. We first applied five expected variant aggregation queries executed over five datasets. These five expected variant aggregation queries are shown as follows.

Q1

SELECT E-MIN(?wgs_pos) WHERE{

 < http://dbpedia.org/resource >  < http://www.w3.org/2003/01/geo/wgs84_pos > ?wgs_pos.}

Q2

SELECT E-MAX(?wgs_pos) WHERE {

 < http://dbpedia.org/resource >  < http://www.w3.org/2003/01/geo/wgs84_pos > ?wgs_pos.}

Q3

SELECT E-COUNT(?wgs_pos) WHERE {

 < http://dbpedia.org/resource >  < http://www.w3.org/2003/01/geo/wgs84_pos > ?wgs_pos.}

Q4

SELECT E-SUM(?wgs_pos) WHERE {

 < http://dbpedia.org/resource >  < http://www.w3.org/2003/01/geo/wgs84_pos > ?wgs_pos.}

Q5

SELECT E-AVG(?wgs_pos) WHERE {

 < http://dbpedia.org/resource >  < http://www.w3.org/2003/01/geo/wgs84_pos > ?wgs_pos.}

The functions of E-MIN, E-MAX, E-COUNT, E-SUM, and E-AVG in Q1 to Q5 are converted into SPARQL Graph Pattern for querying. The running time of the query results is shown in Fig. 3. It is shown that the running time growth of all aggregations is linear and positively correlated with the amount of data covered by the queries. Since E-AVG is calculated from E-SUM and E-COUNT, the occupation cost is significantly increased along with the increasing amount of query data.

Fig. 3
figure 3

Running time of several expected aggregation functions over datasets (ms)

Second, we used five lower-bound variant aggregation queries over the five datasets. These five lower-bound variant aggregation queries are shown as follows (the upper bound aggregation is similar to the lower bound aggregation and was not discussed again).

Q6

SELECT L-MIN(?wgs_pos) WHERE {

 < http://dbpedia.org/resource >  < http://www.w3.org/2003/01/geo/wgs84_pos > ?wgs_pos

}

Q7

SELECT L-MAX(?wgs_pos) WHERE {

 < http://dbpedia.org/resource >  < http://www.w3.org/2003/01/geo/wgs84_pos > ?wgs_pos

}

Q8

SELECT L-COUNT(?wgs_pos) WHERE {

 < http://dbpedia.org/resource >  < http://www.w3.org/2003/01/geo/wgs84_pos > ?wgs_pos

}

Q9

SELECT L-SUM(?wgs_pos) WHERE {

 < http://dbpedia.org/resource >  < http://www.w3.org/2003/01/geo/wgs84_pos > ?wgs_pos

}

Q10

SELECT L-AVG(?wgs_pos) WHERE {

 < http://dbpedia.org/resource >  < http://www.w3.org/2003/01/geo/wgs84_pos > ?wgs_pos

}

The functions of L-MIN, L-MAX, L-COUNT, L-SUM, and L-AVG in Q6 to Q10 are converted into SPARQL Graph Pattern for querying. The running time of query results is shown in Fig. 4. It is shown that the running time growth of all aggregations is linear and positively correlated with the amount of data covered by the queries. Also, the occupation costs of L-SUM and L-AVG are significantly increased along with the increasing amount of query data.

Fig. 4
figure 4

Running time of several lower bound aggregation functions over datasets (ms)

The second group of experiments measured the running time of the aggregate functions under a different number of uncertain options. The uncertain option refers to the entity or property that the query center node can point to when aggregated queries are applied to a fuzzy dataset. With the number of 10,000 triples in dataset1, we adjusted the relationship density of triples in dataset1 and customized the number of uncertain options for aggregate queries. Then we repeated the aggregation queries given by the first group of experiments on a customized dataset1.

The functions of the expected and lower-bound variant aggregations in Q1 to Q10 are converted into SPARQL Graph Pattern for query over customized dataset1. The query's running time results are shown in Figs. 5 and 6, respectively. As we can see from the figures, uncertainties in the dataset almost have no effect on the running time. Also, it can be observed that when the uncertain options of the central node in the query operation is its lower bound 1, the functions of E-SUM, L-MAX, and L-COUNT require even longer running time. This situation is caused by the number of triples in the dataset's constant. When the relationship density between entities in the dataset is low, there should be more entities in the dataset to keep the number of triples constant. Because E-SUM and L-MAX use sub-query aggregation grouping over all RDF triples, and L-COUNT uses accurate counting on triples, their run time costs increase significantly when the number of entities in the dataset increases.

Fig. 5
figure 5

Running time of several expectation aggregations under different numbers of uncertain options (ms)

Fig. 6
figure 6

Running time of several lower bound aggregations under different numbers of uncertain options (ms)

The third group of experiments was to measure the values of H-SUM and L-SUM aggregations under a different number of uncertain options. We applied H-SUM and L-SUM aggregation queries to a customized dataset1. The query results are shown in Fig. 7. It can be seen that as the uncertain options increase, the numerical difference between the values of the H-SUM and L-SUM queries increases significantly. Our experiment proves that the functions of H-SUM and L-SUM can be used as an index to measure uncertain datasets.

Fig. 7
figure 7

Values of the H-SUM and L-SUM query under different numbers of uncertain options (k)

In the last group of experiments, to evaluate the increased time consumption of the proposed aggregate function compared with the classical aggregate function, we compared the running time of aggregate functions proposed in this paper with that of the corresponding standard SPARQL aggregate query. We applied the expected and lower-bound variant aggregation queries and the standard SPARQL aggregate queries. These queries were all executed on dataset1. The running time of the query results is shown in Fig. 8. It can be observed that the standard SPARQL aggregate query takes the shortest time because it does not need to consider and deal with the uncertainty of the dataset. The running time of the lower bound aggregation queries is almost equal to that of the standard SPARQL aggregation queries. The expected aggregation queries take the longest time to run because they need to multiply the values by their membership degrees.

Fig. 8
figure 8

Time-consuming comparison of a variant aggregation function (ms)

6 Conclusions

The fuzzy RDF model has been proposed to represent and deal with fuzzy metadata with semantics, which widely exists in many real-world applications. As part of the analysis and processing of massive uncertain RDF data for information retrieval and decision-making, this paper focuses on aggregate functions over fuzzy RDF data. With the fuzzy RDF model, three variants of the aggregate functions (i.e., Min, Max, Avg, Sum, and Count) in RDF query language are proposed, which are the lower bound of aggregation, the upper bound of aggregation, and the expected aggregation, respectively. With the five aggregate functions and their variants, a total of twenty aggregate functions are applied to the fuzzy RDF data, where each aggregate function can be classified into whole aggregation and grouping aggregation. The paper further investigates how to deal with these types of aggregate operations with two basic strategies, translation and storage procedure, and some formal approaches are proposed accordingly.

Currently, there are no benchmarks for fuzzy RDF data. In this paper, we present our approach for aggregate functions over the fuzzy RDF models and illustrate these functions with examples. At the same time, we also verify the aggregation functions over the fuzzy RDF model on the public data set after fuzzification, proving that this approach is effective for handling fuzzy RDF data. Our approach has scalability in data size and can satisfy the needs of different forms of aggregation queries for uncertain data. The development of a fuzzy RDF benchmark is an exciting step toward evaluating our approach with massive fuzzy RDF datasets. In the present paper, we investigate the SELECT clause that only contains one aggregate function. In our future work, we will investigate how to simultaneously handle the SELECT clause containing several aggregate functions. Also of great interest is the optimization of queries using multiple aggregate functions over fuzzy RDF data.