Keywords

2.1 Introduction

Ever since researchers started talking about “Big Data,” they have mentioned troubles related to it, alongside of its benefits [1]. These troubles often arise from the sheer volume of data but can also involve difficulties of processing, integrating, and analyzing the data. Indeed, an insightful definition of “Big Data” states that a data set can be called Big Data if it is formidable to perform capture, curation, analysis, and visualization on it using the current technologies [2].

However, these troubles sound familiar to computer scientists who have been exposed to statistical databases. Despite progress in data integration [3], use of an integrated business intelligence platform often requires a nontrivial process of identifying data sources, deciphering the meaning of their data, harmonizing the data, and then loading it into a system that can be used to analyze it. This process is known as ETL (Extract-Load-Transform) [4], and in business intelligence the resulting data often ends up in data warehouse servers and, when needed by an analyst, in online analytical processing (OLAP) cubes [5].

But in many organizations, ETL skills are in short supply, and incorporating a new ETL process into a production environment is a lengthy and arduous process. So what happens to the data sources that we find potentially interesting, yet we are unable to include in our current ETL processes? In the mid-2000s, the concept of dataspaces was introduced to describe a situation where there is “some identifiable scope and control across the data and underlying systems, and hence one can identify a space of data, which, if managed in a principled way, will offer significant benefits to the organization” [6]. Moreover, a dataspace support platform (DSSP) provides services for managing such collections of data. Specifically, a DSSP should provide services helping to identify sources in a dataspace and interrelate them, offering basic query mechanisms over them, including the ability to introspect about their contents [3].

In this paper, we present a DSSP specifically designed for data integration using Microsoft Excel data models [7]. The design of the DSSP is workflow-oriented: the user uploads the data sets in the DSSP and describes the semantics of the data by filling a form provided by the DSSP. This metadata can be utilized when the data is imported into the user’s data model . We specifically address the problem known as summarizability [8, 9], by including in the DSSP metadata information needed to determine summarizability in OLAP . Moreover, we describe a process for utilizing the metadata when designing Excel data models using the DAX (Data Analysis Expression) language [10] and Excel’s graphical data model designer.

Our aim is to build and support a dataspace where the data is used within an organization (like a university). This approach corresponds to Halevy et al.’s use case of a dataspace for scientific data management [3]. In other related research, Dittrich [11] presents iMeMex, a comprehensive personal data search and management system that unfortunately is no more functional. Mirza et al. discuss the practicality of a dataspace system from a user’s point of view, mentioning specifically the challenge of combining online and local data sources [12]. Moilanen et al. [13] demonstrate a harmonization platform for XML, but their system is not very suitable for large data sets. Niinimaki and Niemi [14] demonstrate an ETL process with RDF/XML data, but do not discuss how the RDF ontologies and instances are built and populated. Contrary to these approaches, we emphasize a dataspace as a repository of large data sets, the design of data models , and verify the data model using the dataspace. The design of our platform supports the workflow described above.

In our use case, we use real data with magnitude of tens of millions of observations, demonstrate real analysis cases using the data model, and evaluate the performance of our system.

The rest of the paper is organized as follows. In Sect. 2.2, we describe the data that we plan to analyze. This serves as a generalizable basis for Sect. 2.3 that describes the design of our DSSP, and in Sect. 2.4 we study the system’s performance. Finally, Sect. 2.5 contains a summary, research questions for future work, and conclusions.

2.2 Data and Summarizability

Our sample analysis concerns trade of electronic goods among various countries, and the data for the analysis comes from many sources . The most important data sets in the analysis are import-export data from UN COMTRADE (international trade statistics database),Footnote 1 HS (Harmonized System) trade group descriptions from the same source, and per-country GDP per capita figures from CIA World Factbook. Footnote 2 Moreover, we have written a program that translates country names used in CIA World Factbook and Wikipedia in ISO3 county codes and a program to download tables from Wikipedia in the CSV (comma-separated values) format.

All the data sets deal explicitly with countries on a yearly basis; thus a potential integration of the data is not especially hard. Yet, we need to ensure that roles like exporter and importer are sufficiently well expressed. The main data sets, as extracted, are shown in Fig. 2.1. The export-of-commodities set covers years 2000–2015 and has about 94 million lines. As a demonstration of data analysis after data integration, we have selected the following scenario: A company interested in producing electronics goods is seeking to expand outside high-income countries. An analyst is tasked to find out which low- and middle-income countries had largest increases in electronics exports in recent years (2000–2015).

Fig. 2.1
figure 1

Data set descriptions

In a multidimensional data model , there is a set of numeric measures that are the objects of analysis. Each of the numeric measures depends on a set of dimensions, which provide the context for the measure [5]. An implementation of this model is called a dimensional database or a cube. Figure 2.2 shows the data and relations prepared for a cube. Dimensions in a cube can be hierarchical, as product—product group. The entire hierarchy is called a dimension and product, and product groups are called its levels.

Fig. 2.2
figure 2

Relations in the model

Within Excel , columns of data tables can be connected with a relation if they are compatible. Here we use simple technical criteria for compatibility of two columns: at least one of the columns must have unique values and both columns must share some values. The connected columns in Fig. 2.2 in export-of-commodities-2000s are reporter and partner to countrycode’s ISO3 code and to cia_factbook’s country, productgroupcode to prodgroup’s prodgroupcode.

Summarizability can be informally defined as “correctness of aggregate data with respect to individual observations.” In practice, this means that when a user queries a cube, the query should be inspected in such a way that the query’s result cannot violate the conditions of summarizability . Niemi et al. [9] formulate the conditions as follows: (1) the aggregation operation (usually sum, mean, or count) is appropriate for the measure, and (2) the measure is appropriate for the aggregation levels in the cube’s dimensions. We are specifically interested in:

  • The statistical scales of measure variables (see [15]), nominal, ordinal, interval, and ratio scales, since they affect which aggregation operation can be applied to the data.

  • The “eventness” type of measure variable. These we classify [9] as (1) tally measures that are intrinsic information about a specific event like quantity sold in sales data, (2) reckoning measures like inventory levels, and (3) snapshot measures that are indirect measurement based on data at hand like currency exchange rates.

With our dataspace system, we help the analyst detect problems with summarizability by collecting characteristics of data like the statistical scales and eventness of measures. Our ultimate goal is to construct a system that automatically detects summarizability problems in pivot tables created by data analysts.Footnote 3 The design of system at hand is less ambitious: the dataspace system lets us identify potential dimensions and measures and dimensions that are compatible. The compatibility of data within Excel is then checked by a macro that gets the information from the dataspace system. Details of the design of the dataspace system and its integration with Excel are given in the next section.

2.3 System Design

The goal of the design is to provide a catalogue of data sets, combined with their metadata. The users will upload data sets of their interest into the catalogue. After each upload the catalogue software lets the user specify a description of the data set . The data set is expected to have recognizable fields that represent potential measures or (levels of) dimensions. For each measure the user will record its statistical scale, “eventness,” and unit of measurement. For each dimension, the user will record a description and a set of fields in other data sets such that the dimension is compatible with them. The compatibility of fields is stored in the XML format to a file (compatibility.xml) accessible by the HTTP protocol. An interface for entering the metadata (and field compatibility data) after a file upload is shown in Fig. 2.3.

Fig. 2.3
figure 3

Metadata editor

The uploaded data sets with their metadata are presented using a web interface shown in Fig. 2.4.

Fig. 2.4
figure 4

Some uploaded data sets with metadata

The dataspace management application is a Python program that utilizes the Flask web programming framework [16].

In our case study, a company is interested in expanding their business as described in the introduction: among low- and medium-income countries, which ones have grown their exports in electronics most during 2000–2015. The following data sets are thus selected using the interface shown in Fig. 2.4: export-of-commodities-2000s, prodgroup, cia_factbook_countries_by_gdppp, and countrynames_iso3. Assisted by the user interface (Fig. 2.4) or by using the DAX language, the data analyst imports the data in Microsoft Excel’s data model [17] and creates the relations within the data, as was shown in Fig. 2.2.

We can now verify some basic aspects of the data models using our verifier macro. We assume that the user has imported the data sets from the dataspace system as Excel tables, either as tables within Excel sheets or in the data model. The relations within the data can be expressed either using DAX-“related” expressions or using a graphical tool.Footnote 4 The verifier macro has two data sources:

  • Compatibility of dimension fields as recorded in the compatibility.xml file (see above)

  • The relations of tables accessible by the Excel VBA API as “primary table name/primary column name” and “foreign table name/foreign column name”

By comparing the data set and column names in each of the sources, the macro prints if the fields are compatible.Footnote 5 If the macro discovers incompatibilities, it displays a message as in Fig. 2.5.

Fig. 2.5
figure 5

Excel macro compatibility warning

In our analysis case, a few adjustments to the data need to be done before the findings can be presented. First, a dummy field “is_low_or_mid_income” is introduced to the cia_factbook_countries_by_gdppp data, since it will be used as a filter in the pivot table. Second, a product group code is added in the export-of-commodities-2000s data. The product group is simply the 2 first digits of the trade product HS code, 85 for electrical, electronic equipment, whereas the “prod” field in the export-of-commodities-2000s data is 6-digit, for instance, 850,890 for “parts , hand tools with self-contained electric motor.”Footnote 6 Figure 2.6 shows the design of the pivot table such that we apply filters “exporter only low- and middle-income countries” and “product group only electrical, electronic equipment.” Moreover, the figure demonstrates the result of the analysis.

Fig. 2.6
figure 6

Pivot table (left) and analysis based on the data model (right)

2.4 Performance Measurements

In this section we discuss the time consumed in loading the data and the overhead imposed by the dataspace management system. The tests were performed on a computer with Intel Core i5 dual-core 2.4 GHz CPU, 4 GB RAM, and a normal 500 GB 7200 rpm disk. The operating system was 64-bit Windows 7 Enterprise with Microsoft Excel 2016 MSO 64-bit.

2.4.1 Performance of the Dataspace Application

We have tested the dataspace application with multiple random generated files in addition to real files discussed in Sect. 2.3. To eliminate network delays, the tests were performed on the same computer where the application was running. The upload speed is ca 5 MB/s and the download speed ca 38 MB/s. The size of the largest file, export-of-commodities-2000s, was 2.3 GB. Its upload time was 7 min 37 s and download time 1 min 3 s.

2.4.2 Data Loading Performance

Since the export of commodities 2000–2015 data set is very large, it had a dominant role when loading data into Excel’s data model . We have used both this data set and artificially constructed sets to determine the performance . Overall, with our hardware configuration, Excel reads data from CVS files into its data model at ca. 1 million lines per minute, and the performance remains linear. Loading the export-of-commodities-2000s data set took 8 min 30 s, and generating the pivot table of Fig. 2.5 from the data took about 1 min.

2.5 Summary, Conclusions, and Future Work

In this paper we have presented a web-based dataspace management system for large data sets. Our principal aim is to help data analysts discover problems with data integration. This is done by storing the data sets together with their metadata. The metadata includes scale, “eventness,” and unit for measures and a list of compatible fields for dimensions.

The web software was build using the Python Flask framework, available for Microsoft Windows and other operating systems. The software is freely available at https://sourceforge.net/projects/simple-dataspace-management.

Our future research will have three focus areas: summarizability, platform, and cloud integration.

We aim at providing full integrated support to detecting summarizability when the analyst creates an OLAP cube based on the data model . For this purpose we shall improve the “verifier” Excel macro so that it will be able to inspect all DAX language formulas from a pivot table. Since aggregation can be done using the DAX language, the macro will be able to compare the DAX aggregation formula with the metadata from the data management system and find out if the aggregation is correct in terms of summarizability.

The second focus is the dataspace platform itself. We shall expand the dataspace software to make it more user-friendly and capable of storing/inspecting data in various formats. Moreover, we are working on a feature that allows the user to test designs for OLAP cubes using the platform. Figure 2.7 demonstrates the feature: the user has selected the data sets and their fields, and the software analyzes if the combination will produce a good joined data set.

Fig. 2.7
figure 7

Tools for analysis joining data

The third focus is obvious: when dealing with Big Data, cloud systems are its natural habitat (see [18]). We have started regrafting the dataspace manager as a Google Cloud [19] application, and the results have been promising.