SimbaQL: A Query Language for Multi-source Heterogeneous Data

Li, Yuepeng; Shen, Zhihong; Li, Jianhui

doi:10.1007/978-3-030-28061-1_27

Yuepeng Li^13,14,
Zhihong Shen¹³ &
Jianhui Li¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11473))

Included in the following conference series:

International Conference on Big Scientific Data Management

814 Accesses
1 Citations

Abstract

In a data-driven era, scientific discovery by querying integrated heterogeneous data is becoming a popular approach. However, most current search engines retrieve data using SQL, which only addresses part of the common data processing use cases in data discovery for scientific research. Besides, the expressive power of SQL and other popular languages can’t describe some simple but useful second order logic queries. Therefore, we first proposed an abstract data processing model which contains four components: data set, data source, data model, and analysis tool. Then, we introduced a unified data model and SimbaQL query language to describe the steps of data processing. At last, we studied two cases by describing the data processing using SimbaQL, and it turns out that SimbaQL can describe these tasks properly.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Squerall: Virtual Ontology-Based Access to Heterogeneous and Large Data Sources

Multi-model query languages: taming the variety of big data

Article Open access 31 May 2023

SciData: a data model and ontology for semantic representation of scientific data

Article Open access 14 October 2016

Keywords

1 Introduction

Since the concept of “Big Data” was proposed in the 2011 Mckinsey annual report, the definition of “Big Data” has been in dispute on academia and industry [1]. Most of the dispute came from the characteristics of big data in different domains. And in scientific research, data sharing is an obvious different characteristic with other domains [2]. For example, most of nowadays significant scientific programs, such as Human Genome Project (HGP), Large Hardon Collider (LHC), etc. share data around the world. And scientists make discoveries from the open data, including the discoveries of gravitational wave and god particle which win the Nobel prize. Generally, several open source data will be involved in a scientific research. For example, the World Data Center for Microorgannisms (WDCM) [3] published a knowledge graph that is built from 36 open source data including Taxonomy, Genbank, Gen, etc. These data has various formats, such as file, relational database etc. BigDAWG [4] is a polystore database, which manages the medical record, sequence data, image, radio etc. for medical research. The applications mentioned above have a similar characteristic that all of them need search heterogeneous data in data processing. However, the query language of current tools for heterogeneous data cannot describe the whole life cycle of these data processing. Therefore, we proposed a query language for multi-source heterogeneous data, which can import/export data to database, transform data from different data models and integrate multi-source data.

2 Related Work

Multiple types of databases have been used to manage data in nowadays system, such as relational database, graph database, document database, key-value database and so on. These databases build data models to describe the data formally, and retrieval data using the operations or query language of data model. For example, relational databases query the relational data model by SQL; graph databases describe data by property graph model and RDF, and query the graph by Cypher or SparQL; Xquery language search object model for XML file [5].

When it comes to heterogeneous data, there are three methods to access data: data integration, multi-model database, polystore [4]. Data integration [6] is defined as a three tuples (G, L, M), where G is global data model, L is local data model, and M is the mapping from local data models to global data model. Data integration queries local data models using global data model query language. For example, popular-used SQL query engines, such as Impala, SparkSQL, Presto, query multiple data sources using SQL language. Besides, data warehouse, federated database [7] are also methods of data integration. Multi-model database [8] stores different types of data using one unified data model, and query data using unified language, such as OrientDB [8], Ag- groDB, etc. Polystore database stores data in different databases. When query by language A, polyestore database execute the query in A database. If the target data is not stored in A database, then polyestore will CAST the target data to A database [9].

However, above methods for heterogeneous data have several defects. First, data integration mainly use SQL as global query language, but NoSQL databases are de- signed to against SQL, therefore SQL is obviously not suitable for querying different data model. Second, multi-model databases store data in one backend, which is not suitable for multi-source data. Besides, although polystore databases can query data in different ways, but it require users to learn many database languages, which means polystore just transform data between different data models and cannot achieve the goal of unified search. And current polyestore query language only supports one time transform, and cannot query the intermediate results many times.

As regard as the query language for heterogeneous data, current query language have another two defects. On one side, current query language is not designed for data integration specially. For example, in the processing of build knowledge graph, WDCM wants to create the relation between species’s information stored in relational database and the gene sequence information stored in Genbank file. However, SQL failed to treat species and genbank file as a whole to extract and integrate data. On the other side, current languages are good at dealing with different type of data, and most of them can only solve first logic problems [10,11,12]. For example, relational data model describe relationships between objects using foreign key, which means we can only get the relationship between objects by computing. As for Cypher [13], it can only query related data stored in database. The following query is illegal

To solve the problems mentioned above, we first proposed an intermediate data model which can transform between common data models. Then, we defined the operations and query language on the data model. At last, we studied the query language by two scientific cases. The main contributions of this paper are as followed. First, we abstract the scientific data processing by data set, data source, data model, analyzing tools. Second, we designed data model, Linked-Document-Model, and query language, SimbaQL, for data integration specialty. Third, we enhance the expressive power by introducing the operation of describing the property of LDM.

3 Methodology

In big data era, scientific researches are mainly driven by data. If one scientist wants to find which factor is the most important to a country’s happiness, he need to get the published GDP data, agriculture data, political data and so on. These data may be located in files, web service, database and so on. Then, he needs to import theses data into databases, and extract interested part from database. Finally he will integrate all the data to the format that analyze tools can recognize.

Above data processing is not alone. As Fig. 1 shows, the last step in scientific discoveries is analyzing data. However, most of scientist’s work are preparing data from heterogeneous sources. In the process of preparing data, three types of components are involve: data set, data source and analyzing tools. Data set, such as file, can only store data. Data source, such as database, retrieval engine, can not only store data, but also manage data. Analyzing tools, such as Spark, can only process data. We can divide the data process into the following steps:

1.
Find open source data set or build his own data set.
2.
Import data set into proper data source.
3.
Search interested data and integrate them.
4.
Reprocess the integrated data to the format that analyzing tools can recognize.
5.
Analyze the data by tools.

In this paper, we want to design a query language that can accomplish the tasks in 1- 4 steps. For the first step, we define different data set types, such as file, http, web- service, to get the target data. For the second step, we define the import operation of data set. For 3–4 steps, we defined intermediate data model LDM and its mapping rules and operations. Besides, we define the description operation to enhance the expressive power of SimbaQL, and export operation to push the LDM data to analyzing tools. Our abstract model of data processing in scientific research is shown as Fig. 2.

As Fig. 3 shows, heterogeneous data query system that implement SimbaQL contains three components: SimbaQL parser, execute engine and data source. When user input a query, the parser will translate it into data set, data source, and operations of LDM. For example, if we want to query a csv file, the system may import the file data set to relational database, and then perform the query. The system will invoke the query en- gine of this data source when a query only involve a single data source. And If one or many data sources cannot solve the query problem, then execute engine will make a compensate computing for this query.

4 Simba Query Language

In this section, we proposed a unified data model LDM, which contains four types of operations. After that we defined the mapping rules between four often-used data mod- els and LDM. At last, we introduced the informal SimbaQL language grammar.

Linked-Document-Model. The goal of data model is to describe the objective things and relationship between things formally. There are two methods to describe a thing: schema and schemaless. Schema is more suitable for computing, but schemaless way is more suitable for altering thing dynamically. Similarly, there are two methods to describe the relationships between things: storage and computing. Relational data model use computing to get relationship, but the relationship has less semantic. Graph model use storage to get relationship, but it costs storage space.

To satisfy unified query for heterogeneous data, we want LDM has the following three characteristics.

1.
Support both schema and schemaless ways to describe things.
2.
Support both storage and computing ways to describe relationship.
3.
Support describing the data processing in scientific research.

Link and Document. Document is a semi-structured document which has multiple attributes. Each document must has unique key to identify this document. We can add attributes to a document dynamically, and we can also require a document must have some attributes. Generally, we are talking about document set which must have a label name, and one document can belong to multiple document sets. Link is a special document which must have two attributes (from, to). Similar with document set, link set must have a label name. We can define a link set by two ways. For example, we can define the “knows” link by storage way,

$$\begin{aligned} knows=\left[ (from:1, to:2), (from:2, to:3) \right] \end{aligned}$$

and we can also define the link by computing way

Mapping Rule. LDM is a tuple defined as followed. The first item of this tuple is

$$\begin{aligned} ldm=\big (\{ document \; sets \}, \{ link \; sets \} \big ) \end{aligned}$$

a set of document set, and the second item is a set of link set. A LDM can stand for a data source where the document set and link set in LDM is mapped to the components of data source. For example, table in relational database is mapped to a document set, and foreign key is mapped to a link set. As Table 1 shows, we define the mapping rules between relational data model, graph data model, key-value data model, document data model and LDM.

Table 1. LDM mapping rules.

Full size table

Operations. There are four types of operations on LDM: set operation, link set operation, document set operation and transform operation. The detail of these operations are shown as Table 2. Besides, we can access the data in LDM similar to url, the url is defined as followed.

$$\begin{aligned}<datasource>.<document>.<links>.<index>.<attributes> \end{aligned}$$

For example, wdcm.species.gen stands for the gene of the species.

Table 2. LDM mapping rules.

Full size table

Description. We use second order logic to describe what characteristic a LDM should have. And the way we describe characteristic is by rules. For example, we first define a social network, which has person document set and knows link set.

$$\begin{aligned} soc=(\{ person \} , \{ knows\}) \end{aligned}$$

We can require the person set satisfy a transitive closure characteristic: suppose that p1,p2,p3 are documents in person, if p1 knows p2, and p2 knows p3, then p1 must knows p3. This closure can be described as followed,

The last line stands for which part of the set that soc.person should contain. Another case is the person set stratify the condition that it contains all the person that “simba” indirectly knows. The characteristic is defined as followed

The keyword ‘define’ defines a new link named ‘indirknows’.

4.1 SimbaQL Grammar

As described in Sect. 3, SimbaQL need to describe the whole data processing in scientific research. We divide the grammars into four parts: data set grammar, data source grammar, LDM operation grammar, and LDM description grammar. We use keyword ‘create’ to define a data set or data source, and a data set can call the ‘importto’ function to import a data set to data source. Data source can call ‘query’ function to search the local database. At the same time, data source is also a LDM, thus can perform the operations of LDM. For example,

The most used three operations in LDM are selection, projection, union. The grammars for theses operations are as follows,

For the LDM description, we use keyword ‘st’ to represent this LDM should satisfy the following characteristic. For example

5 Case Study

In this section, we apply SimbaQL in two actual scientific research cases. The first case is the data processing when building the WDCM knowledge graph. The knowledge graph contains the information of microorganism species, protein, gene and the relationship between them. In this process, WDCM creates a relational database to store the species’s information, and they want to build the relationship between species and their gene sequence which can get from the file of genbank. This process can be implemented by SimbaQL as follows:

The second case is that National Nature Science Foundation of China has foundation information stored in relational database, and they want to know the mapping between paper and foundation. However, the papers published by a foundation are listed in the summary report. Therefore, they need to extract the papers from the reports, and integrated the data. The paper can be extracted by regular expression, and this process can be described by SimbaQL as follows:

6 Conclusion and Future Work

We introduced SimbaQL for multi-source heterogeneous data query which is a common data processing in scientific research. First, we propose an abstract model of data processing, which consists of data set, data source, data model, and analyzing tools. Then, we defined an intimidate data model LDM)to map the data between heterogeneous data, and operations on LDM to query and integrate data. We study two actual cases using SimbaQL, and it turns out that SimbaQL could describe the common data processing in scientific research. In the future, we will give the formal definition of SimbaSQL, prove the expressive power of language, and implement the query engine for heterogeneous data.

References

Ward, J.S., Barker, A.: Undefined by data: a survey of big data definitions, arXiv preprint arXiv:1309.5821 (2013)
Lu, J., Holubová, I., et al.: Multi-model data management: what’s new and what’s next? (2017)
Google Scholar
World data centre for microorgannisms. http://www.wdcm.org/
Duggan, J., et al.: The BigDAWG polystore system. ACM Sigmod Rec. 44(2), 11–16 (2015)
Article Google Scholar
Chamberlin, D.: XQuery: an xml query language. IBM Syst. J. 41(4), 597–615 (2002)
Article Google Scholar
Lenzerini, M.: Data integration: a theoretical perspective. In: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 233–246. ACM (2002)
Google Scholar
Sheth, A.P., Larson, J.A.: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. (CSUR) 22(3), 183–236 (1990)
Article Google Scholar
Baranyi, P.: Multi-model database orientdb. https://orientdb.com/
Gadepally, V., et al.: Version 0.1 of the BigDAWG polystore system, arXiv preprint arXiv:1707.00721 (2017)
Libkin, L.: Expressive power of SQL. Theor. Comput. Sci. 296(3), 379–404 (2003)
Article MathSciNet Google Scholar
Angles, R., Gutierrez, C.: The expressive power of SPARQL. In: Sheth, A., et al. (eds.) ISWC 2008. LNCS, vol. 5318, pp. 114–129. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88564-1_8
Chapter Google Scholar
Bagan, G., Bonifati, A., Ciucanu, R., Fletcher, G.H., Lemay, A., Advokaat, N.: Generating flexible workloads for graph databases. Proc. VLDB Endow. 9(13), 1457–1460 (2016)
Article Google Scholar
Cypher-the neo4j query language. http://www.neo4j.org/learn/cypher

Download references

Author information

Authors and Affiliations

Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100190, China
Yuepeng Li, Zhihong Shen & Jianhui Li
University of Chinese Academy of Sciences, Beijing, 100049, China
Yuepeng Li

Authors

Yuepeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhihong Shen
View author publications
You can also search for this author in PubMed Google Scholar
Jianhui Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianhui Li .

Editor information

Editors and Affiliations

CNIC, CAS, Beijing, China
Jianhui Li
Renmin University of China, Beijing, China
Xiaofeng Meng
Centrum Wiskunde and Informatica, Amsterdam, The Netherlands
Ying Zhang
CNIC, CAS, Beijing, China
Wenjuan Cui
Tsinghua University, Beijing, China
Zhihui Du

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Shen, Z., Li, J. (2019). SimbaQL: A Query Language for Multi-source Heterogeneous Data. In: Li, J., Meng, X., Zhang, Y., Cui, W., Du, Z. (eds) Big Scientific Data Management. BigSDM 2018. Lecture Notes in Computer Science(), vol 11473. Springer, Cham. https://doi.org/10.1007/978-3-030-28061-1_27

Download citation

DOI: https://doi.org/10.1007/978-3-030-28061-1_27
Published: 07 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28060-4
Online ISBN: 978-3-030-28061-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics