Keywords

1 Introduction

During the last years, communities from different areas have published data in the cloud of ld following the publication guidelines, providing the basis for creating and populating the Web of Data. Currently, there are approximately 74 billion triples over 1000 datasetsFootnote 1.

The increasing availability of these semi-structured and semantically enricheddatasets has prompted the need for new tools able to explore, query, analyze and visualize these semi-structured data [5]. While several different tools such as graph-based query builders, semantic browsers and exploration tools [24, 8] have emerged to aid the user in querying, browsing and exploring ld, these approaches have a limited ability to summarize, aggregate and display data in the form that a scientific or business user expects, such as tables and graphs. Moreover, they fall short when it comes to provide the user an overview of the data that may be of interest from an analytical viewpoint.

ld constitutes a valuable source of knowledge worth exploiting using analytical tools. Business Intelligence (bi) uses the md model to view and analyze data in terms of dimensions and measures, which seems the most natural way to arrange data. bi has traditionally been applied to internal, corporate and structured data, which is extracted, transformed and loaded (etl) into a pre-defined and static md model. The relational implementation of the md data model is typically a star schema. The dynamic and semi-structured nature of ld poses several challenges to both potential analysts and current bi tools. On one hand, exploring the datasets using the available browsers and tools to find md patterns is cumbersome due to the semi-structured nature of the data and the lack of support for obtaining summaries of the data. Moreover, as the datasets are dynamic their structure may change or evolve, making the one-time md design approach unfeasible.

In this paper, we propose md analytical stars as the foundation to query ld. A md analytical star is a md star-shaped pattern at the concept level that encapsulates an interesting md analysis [14]. The star is focused on a subject of analysis and is composed by one or several measures (i.e., measurable attributes on which calculations can be made) and dimensions (i.e., the different analytical perspectives). These stars reflect relevant patterns in the dataset, as the measures and dimensions that compose them are calculated following a statistical approach [15]. To ease the composition of md analytical stars we have developed a web-based prototype tool that allows the user to easily compose them by selecting suggested dimensions and measures for a given subject of analysis. With this user-friendly query builder we are freeing the user from the cumbersome task of browsing, exploring and building queries in specific languages to find interesting analytical patterns in large ld sets. Moreover, the tool is able to translate the user graphical query to SPARQL, execute it and display the results using different charts and diagrams.

We summarize our contribution as follows:

  • We define the concept of md analytical star as a mapping of the md model to ld. That is, we identify the subject of analysis, dimensions and measures that compose a md analytical star in ld.

  • We introduce the notion of aggregation power for the dimensions and measures and make an estimation to filter dimensions and measures according to this score.

  • We have developed a user-friendly web-based tool that allows users to query and visualize results from ld sets by dynamically building md analytical stars.

The structure of the paper is as follows. In Sect. 2 we review the literature related to the problem of analyzing ld. Section 3 presents the main foundations that underlie our approach. In Sect. 4 we present a model for md analytical stars over ld sources. Section 5 summarizes how dimensions and measures are calculated from ld. Section 6 explains the prototype implemented and shows a running example. Finally, Sect. 7 gives some conclusions and future work.

2 Related Work

We have performed a thorough review on the related literature to find out that the majority of approaches use querying, exploration and only light-weight analytics over ld.

For querying ld, SPARQL has become the de-facto standard. However, directly querying a dataset using SPARQL interface cannot be considered an end-user task as it requires familiarity with its syntax and the structure of the underlying data. Graph-based query builders such as [3] can help users build triple patterns by using auto-completion to express queries. However, users do not always have explicit queries upfront, but need to explore the available data first in order to find out what information might be interesting to them. SgvizlerFootnote 2 allows to render results of SPARQL queries as charts, maps, etc. However, it requires SPARQL knowledge and focuses only on the visualization part.

The review in [5] about visualization and exploration of ld concludes that most of the tools are designed for technical users and do not provide an overview or summary of the data.

ld browsers such as [2, 4, 19] are designed to display one entity at a time and do not support the user in aggregation tasks. Most of them use faceted filtering to better guide the user in exploration tasks. However, the user gets overview of only a small part of the dataset. On the other hand, browsers such as [17, 18] provide a more powerful browsing environment, but are tailored to a specific application.

Graph-based tools such as RDF-GravityFootnote 3, IsaVizFootnote 4 or Relfinder [8] provide node-link visualizations of the datasets and the relationships between them. Although this approach can obtain a better understanding of the data structure, graph visualization does not scale well to large datasets.

The CODE Query Wizard and Vis Wizard developed under the CODE projectFootnote 5 are a web-based visual analytics platform that enables non-expert users to easily perform exploration and lightweight analytic tasks on ld. Still, the user has to browse the data to find interesting analytical queries. Payola [12] is a framework that allows any expert user to access a SPARQL endpoint, perform analysis using SPARQL queries and visualize the results using a library of visualizers.

We claim that existing tools for exploration and analysis of ld provide little or no support for summaries so that the user can have an idea of the structure of the dataset and the parts that seem more interesting for analysis. In that line, we have also looked into approaches that provide graph summaries over ld using different techniques such as bisimulation and clustering [1, 10]. However, these graph summaries are produced without an analytical focus, therefore, the resulting summaries may not be useful for analysis purposes.

Recently, there have been some attempts to analyze ld that go beyond querying and browsing. [14] proposes md analysis over ld under the owl formalism. However, most ld sets lack this semantic layer. Other approaches [6, 9] have proposed md analysis over ld relying on the previous manual annotation of the md elements (dimensions and measures) and using previously defined md vocabularies.

3 Background

In this section, we review background concepts on which our approach is based on.

3.1 Linked Data

ld is a set of common practices and general rules to contribute to the Web of Data [7]. The basic principles are that each entity should be assigned a unique url identifier, the identifiers should be dereferenceable by http and the entity representations should be interlinked together to form a global ld cloud.

The most adopted standard to implement the Web of Data is RDF [13], which allows us to make statements about entities. It assumes data modeled as triples with three components: subject, predicate and object. We consider only valid rdf triples using URIs (U), blank nodes (B) and literals (L). These triples can also be viewed as graphs, where vertices correspond to subjects and objects, while labeled edges represent the triples themselves. sparql [16] has become the standard for querying RDF data and it is based on the specification of triple patterns.

In RDF there is no technical distinction between the schema and the instance data, even though it provides terminology to express class membership or categorization (rdf:type). The rdfs extension allows to create taxonomies of classes and properties. It also extends definitions for some of the elements of RDF, for example it sets the domain and range of properties and relates the RDF classes and properties into taxonomies using the RDFS vocabulary. OWL extends RDFS and allows for expressing further schema definitions in RDF. The formal semantics of RDFS and OWL enrich RDF with implicit information that can be reasoned over. Throughout the paper, we refer both to the explicit and implicit triples, which have been derived using some reasoning mechanism. We use the naming convention of OWL referring to classes, properties and individuals to homogenize terminology.

3.2 Multidimensional Models

The md model is the conceptual abstraction mostly used in bi. The observations or facts are analyzed in terms of dimensions and measures [11]. They focus on a subject of analysis (e.g., sales) and define a series of dimensions or different analysis perspectives (e.g., location, time, product), which provide contextual information. Facts are aggregated in terms of a series of measures (e.g., average sales). As a result, analysts are able to explore and query the resulting data cube applying olap operations. A typical query would be to display the evolution of the sales during the current year of personal care products by city.

bi has traditionally been applied to internal, corporate and structured data, which is extracted, transformed and loaded into a pre-defined and static md model. The relational implementation of the md data model is typically a star schema, where the fact table containing the summarized data is in the center and is connected to the different dimension tables by means of functional relations.

4 MD Analytical Stars

In this section we explain how we model md analytical stars from ld.

We formalize the representation of an RDF graph using graph notation.

Definition 1

(RDF graph) An RDF graph G is a labeled directed graph \(G=\langle V, E, \lambda \rangle \) where:

  • V is the set of nodes, let \(V^0\) denote the nodes in V having no outgoing edge, and let \(V^{>0} = V \backslash V^0\);

  • \(E \subseteq V \times V\) is the set of directed edges;

  • \(\lambda : V \cup E \rightarrow U \cup B \cup L\) is a labeling function such that \(\lambda _{|V}\) is injective, with \(\lambda _{|V^0} : V^0 \rightarrow U \cup B \cup L\) and \(\lambda _{|V^{>0}} : V^{>0} \rightarrow U \cup B\), and \(\lambda _{|E} : E \rightarrow U\).

Typical analysis usually involves investigating a set of particular facts according to relevant criteria (dimensions) and measurable attributes (measures). Here, we use the notion of basic graph pattern (bgp) queries, which is a well-known a subset of SPARQL. A bgp is a set of triple patterns, where each triple has a subject, predicate and object, some of which can be variables. We are specially interested in rooted bgp queries, as they resemble the star-shaped pattern typical of md analysis.

Definition 2

(Rooted query) Let q be a bgp query, \(G=\langle V, E, \lambda \rangle \) its graph and \(v \in V\) a node that is a variable in q. The query q is rooted in v iff G is a connected graph and any other node \(v' \in V\) is reachable from v following the directed edges in E.

Example 1

(Rooted query) The query q is a rooted BGP query, with \(x_1\) as root node.

$$\begin{aligned} q(x_1, x_2, x_3, x_5) \text{:- }&x_1 \text{ Annual }\_\text {Carbonemissions}\_\text {kg } x_3, \\&x_1 \text{ Country } x_2, \\&x_1 \text{ Fuel }\_\text {type } x_4 \end{aligned}$$

The query’s graph representation below shows that every node is reachable from the root \(x_1\).

figure a

Even though rooted queries express data patterns by means of the predicate chains, these are still vague as the variable nodes can match any element in \(U \cup B \cup L\). To narrow down the scope of the patterns we define the notion of typified rooted queries as follows:

Definition 3

(Typified rooted query) A typified rooted query \(q'\) is a rooted query with graph \(G=\langle V, E, \lambda \rangle \) where each variable node \(v_x \in V\) has an associated class or datatype. That is, each variable \(v_x\) has an outgoing edge \((v_x, v_y)\) such that \(\lambda ((v_x, v_y)) =\) rdf:type and \(\lambda (v_y) \in U\) and \(v_y\) has an outgoing edge \((v_y, v_z)\) such that \(\lambda ((v_y, v_z)) =\) rdf:type and \(\lambda (v_z) \in \) {rdfs:Class, rdfs:Datatype}.

Example 2

(Typified rooted query) The previous query q can be typified as follows:

$$\begin{aligned} q(x_1,&x_2, x_3, x_5) \text{:- } x_1 \text{ rdf:type } \text{ Powerplant, } \; \text{ Powerplant } \text{ rdf:type } \text{ rdfs:Class }, \\&x_1 \text{ Country } x_2, \; x_2 \text{ rdf:type } \text{ Country, } \; \text{ Country } \text{ rdf:type } \text{ rdfs:Class }, \\&x_1 \text{ Annual }\_\text {Carbonemissions}\_\text {kg } x_3, \\&x_3 \text{ rdf:type } \text{ xsd:float }, \; \text{ xsd:float } \text{ rdf:type } \text{ rdfs:Datatype },\\&x_1 \text{ Fuel }\_\text {type } x_4, \; x_4 \text{ rdf:type } \text{ Fuel }, \text{ Fuel } \text{ rdf:type } \text{ rdfs:Class } \end{aligned}$$

From now on, we omit the type edges and represent typified rooted queries with the (data)type’s name in the variable node.

figure b

It is immediate to see that a typified rooted query is composed by a set of typified paths that go from the root to a sink node (node with no outgoing edges). The root node represents a class and the sink node represents either a class or a datatype. We formalize this notion next:

Definition 4

(Typified path) Given a typified rooted query q with \(x_1\) as root node and graph \(G=\langle V, E, \lambda \rangle \), a typified path is a sequence \(p = c_1 - r_1 - c_2 - r_2 - ... - r_{n-1} - c_f\) where \(\lambda (x_1) = c_1\) is the root class, \(c_f\) is a sink class or datatype, every \(r_i\) is a property and every \(c_i\) has an associated class.

Example 3

(Typified path) In the previous query q we can identify the following typified paths:

(Powerplant, Country, Country)

(Powerplant, Annual_Carbonemissions_kg, float)

(Powerplant, Fuel_type, Fuel)

In md modeling it is important the many-to-one relation between facts and dimensions to ensure aggregation power. That is, one fact must be associated with one dimension value, whereas a dimension value can and should be associated to multiple facts. In our ld scenario, we define the aggregation power of a typified path as follows:

Definition 5

(Aggregation power) Given a typified path \(p = c_1 - r_1 - c_2 - r_2 - ... - r_{n-1} - c_f\), the aggregation power is calculated as the ratio between the number of different individuals (or literals) of the sink class (or datatype) \(c_f\) that satisfy the path and the number of individuals of the root class \(c_1\) that satisfy the path.

Example 4

(Aggregation power) Given the following paths, we calculate the aggregation power as follows:

(Powerplant\(_{(74561)}\), Country, Country\(_{(200)}\)) \(\rightarrow \) \(\dfrac{200}{74561}=0.0027\)

(Powerplant\(_{(19796)}\), Fuel_type, Fuel\(_{(30)}\) \(\rightarrow \) \(\dfrac{30}{19796}=0.0015\)

Notice that an aggregation power closer to 0 exhibits a high aggregation capacity, meaning that the sink individuals act as categories for the root individuals of the path. Aggregation power closer to 1 means low aggregation power of the path.

We are now ready to introduce md analytical stars. For this, we make use of traditional data warehousing terminology. We use the notion of classifier to denote the level of data aggregation, that is, the classifier defines the dimensions according to which the facts will be analyzed. The measure allows obtaining values to be aggregated using aggregation functions.

Definition 6

(md analytical star) Given an RDF graph \(G=\langle V, E, \lambda \rangle \), a md analytical star rooted in the node \(x \in V\) is a triple: \(S=\langle c(x, d_1, ..., d_n), m(x, v), \bigoplus \rangle \) where:

  • \(c(x, d_1, ..., d_n)\) is a typified query rooted in the node \(r_c\) of its graph \(G_c\), with \(\lambda (r_c)=x\) and each path \(x - ... - d_i\) is a typified path. This is the classifier of x w.r.t. the n dimensions \(d_1, ..., d_n\). The node x is the subject of analysis.

  • m(xv) is a typified query rooted in the node \(r_m\) of its graph \(G_m\), with \(\lambda (r_m)=x\). This query is only composed by a typified path \(x - ... - v\) Footnote 6. This is called the measure of x.

  • \(\bigoplus \) is an aggregation function over a set of values, that is, the aggregator for the measure of x w.r.t. its classifier.

  • Each of the typified paths of the classifier has an aggregation power below a threshold \(\delta \).

Notice that typified rooted queries (and therefore, typified paths) are the building block to suggest md analytical stars.

Example 5

(md analytical star) The md analytical star below asks for the average of annual carbon emission of powerplants, classified by country and fuel type.

\(\langle c(x, x_1, x_3), m(x, x_4), average\rangle \)

where the classifier and measure queries are:

$$\begin{aligned} c(x, x_1, x_3) \; :- \;&x \text{ rdf:type } \text{ Powerplant, } \\&x \text{ Country } x_1, \; x_1 \text{ rdf:type } \text{ Country, } \\&x \text{ Fuel }\_\text {type } x_2, \; x_2 \text{ rdf:type } \text{ Fuel } \\ m(x, x_4) \; :- \;&x \text{ rdf:type } \text{ Powerplant }, \\&x \text{ Annual }\_\text {Carbonemissions}\_\text {kg } x_4, \; x_4 \text{ rdf:type } \text{ xsd:float } \end{aligned}$$

The answer to an md analytical star is a set of tuples of dimension values found in the answer of the classifier query, together with the aggregated result of the measure query. Therefore, it can be represented as a cube of n dimensions, where each cell contains the aggregated measure.

5 Calculating Dimensions and Measures

md analytical stars are composed by the classifier and measure typified queries rooted in a potential class acting as subject of analysis. These typified queries are composed by typified paths with a certain aggregation power. For space restrictions, we omit the process of calculating the set of typified paths from a ld. The method is based on probabilistic graphical models and we make use of the statistics about instance data to generate the paths. We refer the reader to Sect. 5 of [15] for details and to Sect. 6 for checking the experimental evaluation. As a result, a set of typified paths with aggregation power below a threshold \(\delta \) are obtained.

Paths are classified into dimensions and measures based on the aggregation power of the path and the type of the sink node. Paths ending in numeric datatypes (i.e., xsd:integer, xsd:float, xsd:double, etc.) and with low aggregation power are considered measures, whereas the rest of paths (i.e., paths ending in classes or datatypes with high aggregation power) are considered dimensions.

The set of calculated dimensions and measures from a ld dataset is used by the prototype tool to help the user build md analytical stars. In the next section, we show the implemented prototype along with a running example.

6 Prototype Tool

The current prototype for building and executing md analytical stars over ld sets has been implemented as a client/server web application. The architecture is shown in Fig. 1. We distinguish the client, the server, an external ld endpoint, and the catalogue of calculated dimensions and measures. For the client we use ajax and html5/css3. The server is implemented as a restful api in php. It has three main tasks: (1) handle requests from the user about the dimensions and measures of the catalogue, (2) act as a proxy to send queries to the ld endpoint and (3) process the results of the endpoint. The external ld endpoint is a SPARQL endpoint that we access to execute the md queries. The catalogue is the application data, that is, the dimensions and measures calculated for the ld set. We have implemented the catalogue as an independent RESTful web service to make it portable.

This prototype helps the user build md analytical stars by suggesting possible dimensions and measures for a specific subject of analysis (steps 1–4). Moreover, it is completely functional, as the queries graphically built by the user are automatically translated to SPARQL queries over the dataset endpoint, which returns the results that can be either displayed or exported in different formats (steps 5–7).

Fig. 1.
figure 1

Prototype architecture.

For demonstration purposes, we have selected the EnipediaFootnote 7 ld dataset, which is an initiative aimed at providing a collaborative environment through the use of wikis and the Semantic Web for energy industry issues. The dataset provides energy-related data from different open data sources structured and linked in RDF. The dataset contains around 5M triples.

Next, we show the different functionality of the prototype. The home page is shown in Fig. 2. The left navigation menu shows the different steps to perform a md query, which are: (1) show most interesting subjects of analysis, (2) select one, (3) select measures and dimensions and (4) show results. We will go through all of them by means of a running example.

In the home page the user can ask to show the n most interesting subjects of analysis if (s)he has no knowledge of the dataset, or directly type the name of a class, which comes with the autocompletion feature. Interestingness is measured by the number of individuals of the class. In Fig. 2 we select powerplant as subject of analysis as we are interested in analyzing the annual carbon emission rates of powerplants from different perspectives.

Fig. 2.
figure 2

Home page. The user selects to display the n most interesting subjects of analysis or types one.

The next step consists in selecting measures and dimensions for the subject of analysis. These are suggested by the prototype and the user is in charge only of selecting the ones required. Figure 3 shows the panel where all available measures for the subject powerplant are displayed and the user can select one or more by ticking the box and selecting the appropriate aggregation function. In this case we select Annual_Carbonemissions2000_kg and average as the aggregation function.

Fig. 3.
figure 3

Measures selection with aggregation function.

The process of selecting the dimensions is shown in Figs. 4 and 5. The left panel of Fig. 4 shows the possible dimensions. In this case, the user selects two dimensions, Country and Fuel, meaning that the analysis (s)he has in mind consists in displaying the average carbon emission levels of the powerplants by country and the type of fuel of each powerplant. The right panel serves the purpose of disambiguating a selected dimension. Notice that a dimension is defined by a typified path from the subject of analysis to a sink class. By only selecting the sink class as dimension there can be ambiguities as a sink class can be reached by different paths and the properties of the paths may give different semantics to the dimensions. Therefore, in the right panel we display all the paths that lead to the selected dimension so that the user can disambiguate. In the case of Country there is no ambiguity. However, for Fuel the user select the path Fuel_type.

Fig. 4.
figure 4

Dimensions selection and optional filtering.

The prototype also offers the feature of dimension filtering. This is shown in Fig. 5 and can be enabled with the button next to each dimension. In case we activate this option, a new panel will appear (i.e., right panel) displaying the individuals that belong to the selected dimension. In this case, the user wants to filter the values of the dimension Country and selects only specific countries for the analysis.

Fig. 5.
figure 5

Filter dimensions by value.

Fig. 6.
figure 6

md query results.

Finally, Fig. 6 shows the results of the md query. These can be displayed in different formats and using different visualizations or exported to a file.

7 Conclusions and Future Work

In this paper we have presented the foundations and a prototype tool for ld analysis and visualization following a md approach. We have proposed md analytical stars as the foundation to enable the user to easily query ld sets. The md patterns (i.e., dimensions and measures) suggested by the web-based developed tool to create the stars are based on the semantics of the data (i.e., they provide a conceptual summary of the data), follow the md model (i.e., information is modeled in terms of analysis dimensions and measures) and are extracted following a statistical approach. The tool has been implemented using web-based technologies.

As future work, it would be interesting to group semantically similar dimensions into categories to offer a cleaner view to the user. Another interesting functionality is to provide dimension hierarchies to be able to perform the classical roll-up/drill-down operations of OLAP over the results. In a near future we will also address the current issue of having a batch process for calculating the dimensions and measures off-line. We aim to apply query sampling methods directly over the ld endpoint to be able to dinamically calculate dimensions and measures. This process will be smoothly integrated in the web-based tool.