COACT: a query interface language for collaborative databases

Mershad, Khaleel; Malluhi, Qutaibah M.; Ouzzani, Mourad; Tang, Mingjie; Gribskov, Michael; Aref, Walid G.; Prakash, Deo

doi:10.1007/s10619-017-7213-1

COACT: a query interface language for collaborative databases

Published: 13 November 2017

Volume 36, pages 121–151, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Distributed and Parallel Databases Aims and scope Submit manuscript

COACT: a query interface language for collaborative databases

Download PDF

Khaleel Mershad¹,
Qutaibah M. Malluhi²,
Mourad Ouzzani³,
Mingjie Tang⁴,
Michael Gribskov⁴,
Walid G. Aref⁴ &
…
Deo Prakash⁵

443 Accesses
2 Citations
Explore all metrics

Abstract

Data curation activities in collaborative databases mandate that collaborators interact until they converge and agree on the content of their data. In a previous work, we presented a cloud-based collaborative database system that promotes and enables collaboration and data curation scenarios. Our system classifies different versions of a data item to either pending, approved, or rejected. The approval or rejection of a certain version is done by the database Principle Investigators (or PIs) based on its value. Our system also allows collaborators to view the status of each version and help PIs take decisions by providing feedback based on their experiments and/or opinions. Most importantly, our system provided mechanisms for history tracking of different versions to trace the modifications and approval/rejection done by both collaborators and PIs on different versions of a data item. We labeled our system as Update-Pending-Approval model (or UPA). In this paper, we describe a high level SQL query interface language for PIs and collaborators to interact with the UPA framework. We define a set of UPA keywords that are used as a part of the history tracking mechanism to select specific versions of a data item, and a set of UPA options that select specific versions based on possible future decisions of PIs. We implemented our query interface mechanism on top of Apache Phoenix, taking into consideration that the UPA system was implemented on top of Apache HBase. We test the performance of the UPA query language by executing several queries that contain different complexity levels and discuss their results.

AUDIT: approving and tracking updates with dependencies in collaborative databases

Article 21 September 2017

MySQL Collaboration by Approving and Tracking Updates with Dependencies: A Versioning Approach

User Trust and Judgments in a Curated Database with Explicit Provenance

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In collaborative environments, e.g., scientific databases, large volumes of data are shared among scientists that collaborate to curate data coming from experiments and analytical processes. A typical scenario is when scientists produce some dataset, share it among collaborators that are not necessarily co-located, collectively update it, share the updates among themselves for commenting, and keep updating the data until they converge and agree on the final content. Once the dataset stabilizes, they eventually publish it. For example, this scenario takes place when deciding on the function of a particular gene or allele, or a protein binding site. While being discussed and commented on, the data along with all of its updates should be visible to all collaborators. Ultimately, the data custodian, e.g., the principal investigator (PI, for short), makes the final decision of approving some updates and rejecting others.

Current database technologies fall short in supporting the above collaborative environment. SQL supports GRANT/REVOKE access models that allow users to have data access rights based solely on the identity of the user [1, 2]. In this case, when an update takes place, it is not reflected into the database until the update-issuer commits the update, e.g., towards the end of the update transaction. Once committed, the update is visible to the collaborators. However, if this update needs to be approved by the PI, the updated value cannot be committed until approved. Hence the updated value cannot be shared with the other collaborators for commenting. Note that we will be using the terms user and collaborator interchangeably throughout the paper.

In previous papers [3, 4], we presented an Update Authorization mechanism (UPA) which enables collaborators to view updates that are made by other collaborators once they are inserted to the database. For that, we added a status field for each new version of a data item which usually takes one of three values: approved, rejected, or pending. Once an update is made, the new version can be viewed by collaborators as pending approval by the PI. In this case, collaborators can use this new version in their experiments or calculations and provide feedback for the PI which will help him/her make the decision to approve or reject the pending version. In [4], we extended the initial UPA system to support various types of dependencies between the different versions of the same data item, and defined dependency rules for the Insert, Approve, and Reject operations.

A typical example where UPA is needed occurs when a group of collaborators are working together on a wiki page. An initial page is created by its owner and collaborators update it by adding new information or deleting/updating exiting data. Each time a collaborator commits an update, a new version of the page is created, marked as pending, and made available to others. Usually, the owner of the page (the PI in UPA) has the full control on the wiki page and decides which pending versions are accepted or rejected. The decision of approving or rejecting a certain version could be based on the comments that were inserted on the version. In such scenario, it is often required to search in the history of updates to find specific versions that were accepted, rejected, or still pending. Unfortunately, current database systems does not allow the user to access a version directly based on its status. For example, if the PI wants to know the user IDs of all versions that were previously rejected, or if she wants to know the timestamp of the last rejected version. Such conditions cannot be explicitly defined in the queries of current database systems. Rather, the user has to write a complex query to search for a specific version or set of versions.

To better illustrate the advantages of having a query language for multiversion data that includes keywords for history tracking, consider the example in Fig. 1 in which two users , Alice and Bob, collaborate with the PI in updating a wiki page. At time \(I_1\), user Alice updates object A inside the page and changes its value from 1 to 2. At time (\(I_2\)), the PI and Bob are notified of the update, which is marked as pending approval until the PI approves/rejects it. In the meantime, Bob can view and comment on A. Any further update made by Bob before the approval/rejection of the value 2 is also marked as pending. The PI can view the feedback from all other collaborators, e.g., from Bob, before approving or rejecting the update. At time (\(I_3\)), the PI examines the update and approves/rejects it. Next, at time (\(I_4\)), if the PI approves the update, it is marked as approved and it becomes A’s current approved value. On the other hand, if the PI rejects the update of user Alice, the value 2 of data object A is marked as rejected, and the value 1 will remain to be A’s current approved value. Data modification actions, i.e., update, approve, or reject, are recorded in UPA so that each collaborator can track the history of an object and the status of each update. The complete details of the UPA system can be found in [3, 4].

In order to illustrate the importance of creating a new Query Interface Language for UPA, we note that in current database systems, if a collaborator wishes to examine a certain version or set of versions of A, he/she can either retrieve all versions and search manually for those required, or write a complex query to search within the metadata of each version for its status and timestamp. The collaborator is restricted to these methods because existing query languages doesn’t have a method to search for a certain version of a data item based on its status and/or its location within the whole set of versions of the data item. In this paper, we present a Query Interface Language for UPA that allows users to easily retrieve a certain version or set of versions using UPA keywords and options that are added to the traditional SQL query format.

The UPA model is implemented on top of Apache HBase [5]. HBase is a distributed and scalable cluster-based database that is suitable to store big data on the cloud [5]. While HBase supports version and history tracking of updates, it does not support the notion of update approval or rejection that are essential features in data collaboration environments. Hence, the UPA model extends HBase with the following functionalities: First, UPA maintains the history of all the updates for a given data item along with the associated meta-data. Each update is marked as Approved, Rejected, or Pending. UPA extends HBase to allow for querying (1) the history of all updates, (2) the approved data only, or (3) the most recent values only (approved or pending). Second, UPA introduces three modes of operation for a data cell depending on most queried values, i.e., (1) last inserted values, (2) last approved values, or (3) both the last inserted and the last approved values. In addition, UPA can dynamically switch between modes to adapt to the current query workload. For more details on UPA, please refer to [3, 4].

In this paper, we present an SQL query interface model that is used by collaborators and PIs to interact with the UPA system; mainly, to easily and efficiently retrieve specific versions or specific parts of the update history of a data item. We define a set of UPA keywords that identify versions based on two specifications: the position of the version in the timeline of the update history (for example, first, last, ...), and the status of the version (pending, approved, or rejected). We also define a set of UPA options, which contains three members: (true values, false positives, and false negatives), which define whether a value will satisfy a condition or not based on future decisions by the PIs. We explain how the UPA keywords and options can be integrated within an SQL query and interpreted by the system. We test the performance of multiple SQL queries that contain different UPA keywords and options with different complexities. We name the system proposed in this paper: A COllAboration meChanism for querying daTabases or COACT.

The rest of this paper proceeds as follows. Section 2 discusses the related work. Section 3 presents an overview of UPA. Section 4 introduces the SQL query interface for UPA and the implementation strategy. We presents our experiments in Sect. 5 and conclude in Sect. 6.

2 Related work

In this section, we present several query languages that have been proposed to support the interactions with different types of databases. Mainly, we focus on query languages for temporal databases, transaction-time databases, multiversion databases, web databases, and graph databases. We also illustrate query languages that search in the history of data similar to the language proposed in this paper.

2.1 Temporal databases

In temporal databases, timestamps are associated with object versions, tuple versions or attribute versions. Timestamps may be either single or multidimensional, based on validation time, transaction time or custom user-defined time. Query languages for temporal databases are funded on the semantics of time [6,7,8]. Their main feature is the exploitation of the total ordering of versions imposed by each temporal dimension. However, time is only one of multiple semantics which may be associated with versions. For instance, in UPA, we have the notion of version status, which can be pending, approved, or rejected. Hence, in COACT we combine the time dimension with the status dimension to create the UPA keywords and options.

2.2 Transaction time database systems

Transaction time database systems [9] provide access to both current and historical information. Immortal DB [10] is a transaction time database system that uses versions to support access to both current and historical data. In Immortal DB, a record version is linked to its immediate predecessor version via a version chain. The Immortal DB prototype provides transaction time functionality via a collection of changes to SQL Server. First, a transaction time table is specified via an IMMORTAL attribute in the table create statement. Second, an AS OF clause added to the transaction statement specifies queries on historical data. Finally, each relational tuple has 14 bytes appended at the end that contains versioning information, which includes a timestamp, a sequence number, and a version chain pointer. In [11], the authors integrate a temporal indexing technique, the Time Split B-Tree (TSB-tree), into Immortal DB to serve as an integrated index for accessing both current and historical versions of data. To achieve that, the TSB-tree performs both key splits, like a B+tree, and time splits. Time splits are required to index versions both by key and by time. Versions whose end times are before the split time are moved to the historical page. Versions with start times later than the split time are put in the current page. Accessing current data is done via existing B+tree code, while accessing historical data is done via the TSB-tree index. In the TSB-tree, data is partitioned by key and time into key-time rectangles. Each TSB-tree index term that references a data page includes a description of this key-time rectangle and a pointer to the page containing data in this rectangle.

2.3 Multiversion databases

An approach which is based on first order calculus, was provided by the authors of [12] for querying data that is stored in multiversion databases. The query language in [12] (VQL) provides users with the ability of navigating through object versions. The approach is to define a global set of Database Version identifiers V, which contains all versions of an object over time. The queries are first expressed in a first order representation, and then translated to object Query Language (OQL). This approach is efficient when the number of versions is relatively small. However, in large databases, when the number of versions of certain objects increases tremendously, the approach becomes invalid, as the possible combinations between object identifiers and DBV identifiers will grow indefinitely.

2.4 Web data evolution

How to track the evolution of data in data web archives was presented in [14]. The approach aims at preserving Linked Open Data (LOD) by keeping them constantly accessible and integrated into a framework for evolving datasets that offers functionality for versioning, provenance tracking, change detection and quality control while at the same time provides ways for querying the data both statically and across time. A key feature of the system is the evolving entity object, which captures structural and semantic constructs of a dataset and acts as a common placeholder for provenance, temporal, change, quality and trust metadata. The query language presented in [14] provides four main types of queries: (1) data queries, (2) longitudinal queries, which limit the timeline to specific versions or time periods, or successive versions, (3) queries on change, which retrieve changes between two versions of an entity, and (4) mixed queries on changes and data. Similar to COACT, the query language in [14] appends custom keywords, such as FROM DATASET, FROM CHANGES, AT VERSION, AFTER VERSION, BEFORE VERSION, and BETWEEN VERSIONS.

2.5 Graph databases

One type of databases that is gaining importance in the recent years is Graph databases (GDB) that are emerging as the prime choice for storing and managing data in graph-like structure. Graph query languages need not only SQL-like functionality for querying graph data, but also intrinsic support for typical graph-style applications, such as reachability analysis, path finding and graph construction. In general, database languages for GDBs can be divided into two main types: (1) graph pattern-matching query languages and (2) general purpose imperative graph languages. The first type includes: (i) SPARQL [15], the standard query language for the Resource Description Framework (RDF) model. A SPARQL query usually consists of triple patterns, conjunctions, disjunctions, and optional patterns. SPARQL provides specific graph traversal syntax for graph data. (ii) Cypher [16], the query language of the graph database Neo4j, was the first pattern-matching query language to target the Property Graph (PG) data model. In addition to the standard graph elements of nodes and edges, Cypher adds labels and properties as concepts. (iii) GraphQL [17] is a pattern-matching query language based on a postgreSQL-like model. The second type includes: (i) Green-Marl [18], which is a language that is specifically designed to express graph analysis algorithms for the PG data model. (ii) GRAPHiQL [19] is a graph language for relational databases based on Pregel-style vertex-centric programming. (iii) Gremlin [20] is a graph traversal language that uses JVM-based languages such as Java and Groovy as host language.

2.6 Data citation

How to access the history of evolving data for citation purposes was presented in [13]. The proposed system assigns persistent identifiers (PIDs) to database queries, which define the subset of data to be used in the query. Similar to our system, the approach relies mainly on recording timestamps for all operations that modify the data and maintaining their history with the original values. The system in [13] stores the timestamp of a SELECT query Q and the resulting output subset of data that need to be persistent and citable. It also assigns a PID to the SELECT query which serves as identification for the citation record. Whenever the subset that was delivered by Q is referenced at a later point in time, the SQL query statement obtaining the previously stored data set denoted Q’ is executed. As the data in the database is evolving, updates and deletes occur to the original data. For obtaining the original result set at a later point in time (for correct citation), the query Q’ has to be rewritten in order to include only data as it was valid during the time of the execution of Q. Contrary to COACT, the approach in [13] does not contain a notion of status of the data, and does not provide a formal method for all cases required in query rewriting.

A key contribution that distinguishes our work is the ability to query historical data based on the status of the data record and the decisions made in the future by the PIs. These contributions are explained in the next sections using UPA keywords and UPA options. To the best of our knowledge, COACT is the first system to introduce the status dimension in the query processing of historical data, and the adaptation of such queries to future PI decisions.

3 Overview of UPA

In this section, we provide a brief overview of the HBase architecture with its data access and manipulation operations. Then, we illustrate how we extend HBase to support the UPA model. This section contains a summary of the UPA model. For complete details, the reader can refer to [3, 4].

3.1 HBase basics

HBase is a column-oriented data store running on top of HDFS [21, 22]. Instead of complete rows, data is stored as fragments of columns in the form of \(\langle \)key, value\(\rangle \) pairs. HBase contains two main components: (1) an HMaster, and (2) a set of Region Servers. The HMaster is responsible for administrative operations, e.g., creating and dropping tables, assigning regions, and load balancing. Region Servers manage the actual data stores. A Region Server is responsible for region splitting and merging, and for data manipulation operations [5]. An HBase table may have one or more regions, where each region contains HBase Store Files that are the basic data storage units. In addition, each region contains a temporary Memory Store and a Write-Ahead-Log file for fault tolerance.

HBase contains four basic operations: Put, Delete, Get, and Scan. Put inserts the value of a data item or cell into an HBase table. Each data item or cell is identified by four fields that constitute its key, namely, Row, Column Family, Qualifier, and Timestamp. Row is the main identifier in an HBase table, similar to a primary key or a tuple identifier in a relational database, Column Family groups related qualifiers together, and Timestamp is usually the time at which the value has been inserted. Put can either insert values into a newly created qualifier, or update the value of an existing qualifier. In the latter case, the previous value is not removed, but is saved as an older version of the data. Hence, HBase uses the concept of versioning in which a set of previous versions of a data cell are saved as specified by the table specifications. Delete places a marker, termed a tombstone, on the data values in the memory store. When data is flushed into disk, e.g., during an Hbase major compaction operation, the data is actually deleted. Finally, Get retrieves data that is within a single Row. In contrast, Scan retrieves data from an entire table.

3.2 History tracking

Tracking the History of data refers to the process of monitoring (i) how the data has changed since the time it has been first created, (ii) who changed the data, at what times, and possibly (iii) the reasons, if any, behind the changes. Data History Tracking is an important operation in collaborative environments as well as in data auditing in database systems. The validity of future results depends on the correctness of existing data and how it has been derived.

History tracking helps answer a variety of questions including: Which users mostly insert correct data? Which inserted data is found to be faulty and hence should be rejected? How many values were rejected during the lifetime of a cell, i.e., what is the rejection percentage for the cell? What is the most common reason for rejecting values of a given cell? Has a given value been assigned before for a given cell? and if so, why has it been refuted or changed? and finally, how is the evolution history of values of a given cell?

Existing DBMSs do not provide means for answering these questions related to history tracking. For instance, HBase saves the new value of an existing cell as a new version of the cell. The user can query the value of the cell and the time at which it is inserted. Other important provenance information, e.g., the user who inserted the data, and the parameters of the environment in which the data was generated (which is useful for scientific experimentation), are not saved. Hence, the process for data history tracking misses important features and needs to be augmented to satisfy the requirements of researchers and collaborators.

UPA enables history tracking features by saving, with each operation on a data cell, the metadata related to the operation, e.g., the timestamp, the user ID, the type and reason for the operation, the status of the data value, and any other metadata related to the environment in which the data is generated, e.g., the machine ID. The proposed history tracking mechanism supports operations on saved data, e.g., reverting a data cell to a previous stable state, and building a timeline of the lifecycle of data cells.

The process of approving or rejecting new updates is performed by privileged users, e.g., principal investigators, lab leaders, or data administrators, who monitor and approve or reject the data values inserted or updated by various collaborators. For simplicity, we refer to a user with approval and rejection authorities as the PI. We differentiate between three types of inserted data values: (i) pending: a data value that is yet to be approved or rejected by a PI, (ii) approved: a data value that has been approved by a PI, or (iii) rejected: a data value that has been rejected by a PI. A PI can approve or reject data items either individually or as groups of data items through bulk approve and reject operations (complete details in [3]).

Storing the data and the history of the updates over time along with the other related metadata would adversely affect the overall performance of the system. Users who are not interested in querying or accessing historical data should not be penalized by the potential overhead. To address this issue, we store the history of a data cell in a separate storage entity, termed the History table. Only a specific version of the cell will be stored in its original table, as we will explain in the next section.

3.3 Supporting UPA

In this section, we present the proposed format of the History table and how it complies with HBase standards. In HBase, each cell is identified by its \(\langle \)Table, Row, Family, Qualifier\(\rangle \) tuple, while each version of the cell is defined by its timestamp. To illustrate the format of a history record, we present an example in Fig. 2. Suppose that user User1 inserts the cell: \(\langle \)Table:ROW1:F1:Q1\(\rangle \) at time T1. The value of 2 is inserted into the original table and the first history record of the cell is inserted into the History table. The status of the history record is marked as “pending”. Suppose that the PI approves the history record at Time T2, setting the status of the history record to “approved”. Suppose that at Time T3, User2 inserts the value 5 into the cell. The cell value in the Original table is set to 5, a new history record is added to the History table with a value equal to 5, and its status is set to “pending”. Now, suppose that at Time T3, User3 deletes the cell in the Original table. In this case, the cell in the Original table is marked as deleted, and a new history record, with operation equal to “delete” and status equal to “pending” is added to the History table. The last two history records are both conditional, pending approval or rejection by the PI.

Hence, different values that are inserted into the same cell are stored as different versions of the cell in the history table. The creator of an HBase table defines the maximum number of versions per cell, say n, that can be kept; if n is exceeded, only the newest n versions of the cell are kept and the older versions are automatically deleted. In UPA, versions are managed through the History table. Thus, all data tables will have a default value for n equal to 1. In such tables, which we refer to as the original tables, we save either the last inserted value or the last approved value. In UPA Mode 3 (explained in the next section), we need to save both the last inserted and the last approved values of the cell. In such case, we save these values as two different cells, each pointing to the original cell. For example, if we want to save these two values for a cell \(\langle \)Table, row1, family1, qualifier1\(\rangle \), then the last inserted value of the cell is saved in a qualifier termed ’qualifier1:current’ and the last approved value of the cell is saved in a qualifier termed ’qualifier1:approved’. Using this approach, we keep the original tables as compact as possible, leading to high efficiency when executing queries that do not target the history of the cell.

On the other hand, the maximum number of versions in the History table is set to the maximum expected number of versions per cell. Since each newly inserted value of the cell creates a new record in the History table, and we want to keep all history records as different versions of the cell, as long as the cell is not deleted; hence, the value of n for the History table is set to the maximum expected number of versions per cell for all cells of the original table. n will be specified by the creator of the original table at creation time. For example, suppose that the maximum expected number of versions of any cell in the original table will be less than 10,000, then n can be set by the table creator to 10,000.

3.4 The approval process

A history record includes, in addition to the cell value and the timestamp at which the value was inserted, the operation (i.e., insert, update, delete, or approve), the user ID, the record status (i.e., pending, approved, or rejected), and comments, e.g., the ID of the machine used to generate the value and/or experimental parameters. History records of the same cell are automatically sorted according to their timestamps (i.e., the insertion time). Hence, it is efficient to jump to the next or to the previous records.

A PI can approve or reject pending records (i.e., versions) using two methods. In the first method, the PI specifies, in addition to the ID of the cell (i.e., Table, Row, Family, and Qualifier), a list of timestamps of records to be approved (or rejected). The UPA system sequentially approves (or rejects) these records. In the second method, the PI specifies a range using two timestamps, the UPA system searches the History table for all records of the cell whose timestamps fall between the specified timestamps, and then approves (or rejects) these records.

3.5 The UPA modes

For the original data table in which the cell is created, we distinguish in UPA two important values of each data cell: the last inserted value and the last approved value. Since these two values are mostly important for users who query the cell, we proposed in UPA three design alternatives as follows:

If users are mostly querying the most recently inserted value of the cell, the cell will operate in UPA Mode 1, in which only the most recently inserted value of the cell is saved in the original table. Each newly inserted value will overwrite the existing value in the original table. All other older values, whether pending, approved, or rejected are stored in the History table
If users are mostly querying the most recently approved value, then this value is saved in the original table. Each newly approved value will overwrite the previous approved value. At the same time, all the more recent updates to the cell, which are still pending, and all previous values of the cell, whether approved or rejected, are stored in the History table. The cell will be operating in UPA Mode 2.
If users heavily query both the most recently inserted and the most recently approved values, then both values will be saved separately in the original table, while all other versions of the cell are stored in the History table. The cell will be operating in UPA Mode 3.

Using these alternative UPA Modes enables users that query the most recent value of a cell (or the most recent approved value) to obtain the result directly from the original table in an efficient way without having to access the history table. Users interested in History Tracking should query the data in the History table that resides in a separate storage entity. Regardless of the UPA Mode of the cell, each time an operation is conducted on the cell, a history record that contains all necessary metadata is written to the History table.

The UPA system dynamically changes the operation of a cell from one UPA Mode to another as follows: for each cell, the system maintains and updates three parameters: a mode variable, the cost rate for searching for the most recently inserted value, and the cost rate for searching for the most recently approved value. The cost rate is calculated by dividing the total cost for a period of time by the value of that period in seconds. In order to calculate the total cost within a time period for searching for the most recently inserted or most recently approved values, the system periodically calculates, from the user queries, the average delay for a query that searches for a value in the original table (\(S_O\)) and the average delay for a query that searches for a value in the History table (\(S_H\)). Depending on the current UPA Mode of the cell, the system calculates the total cost by multiplying \(S_O\) or \(S_H\) by the total number of search queries. For example, suppose the cell is currently operating in UPA Mode 1, then the system calculates the total cost for searching for the most recently inserted value by multiplying the number of queries that search for this value within the observed period of time (\(t_i\)) by \(S_O\), and the total cost for searching for the most recently approved value by multiplying the number of queries that search for the most recently approved value within \(t_i\) by \(S_H\). When the cost rate for searching for the most recently inserted value is much higher than that for searching for the most recently approved value, the system changes the cell to operate in UPA Mode 1 by setting mode to 1. If the cost rate for searching for the most recently approved value is much higher than that for searching for the most recently inserted value, mode is set to 2. If both cost rates have near values, mode is set to 3. All UPA operations (update, delete, approve, ...) will read the mode variable before execution to know the current UPA Mode of the cell and perform the operation according to that. For example, when the UPA Get operation searches for the last inserted value, it examines mode. If mode is equal to 1 or 3, it searches in the original table of the cell. If mode is 2, it searches in the History table.

4 UPA query interface language

For each new version of a data cell, a status field is assigned to this version in the History table as a part of the record of this version. As explained before, when the version is inserted, its status is set to pending, meaning that the approval of this version is pending the PI decision. Once the PI decides whether this version is valid or not, he/she will either approve the version or reject it. If the PI approves this version, its status is changed to approved. Else, if the PI rejects the version, its status is set to rejected. Hence, we can distinguish between three disjoint sets of versions of each cell: pending set, approved set , and rejected set.

4.1 UPA keywords and options

4.1.1 Definitions of UPA keywords

In order to enable the user to easily query a specific version or a specific set of versions of a cell, we define a list of keywords that we will be using in our Query Interface Language (QIL). We will build our QIL on top of SQL, which is currently used by several systems, such as Hive [23], Impala [24], and Phoenix [25], to query HBase. We divide our list of keywords to two distinct parts: the first part defines the location of a version within the timeline of its set, while the second part defines the status of the version. Using this definition, we realize four keywords in the first part: first, last, any, and all. For the second part, we have three keywords, which are: pending, approved, and rejected. The user can use any combination of a keyword from the first part with another keyword from the second part to specify a certain version. For example, first approved defines the first version that was approved, or the first version in the set of approved versions, any rejected defines a random version in the set of rejected versions, and so on. However, each of the seven keywords can be used alone. For example, first alone means the first version of the complete set of versions of the cell, all means the complete set of versions of the cell, pending means the set of pending versions of the cell, and so on. Table 1 illustrates the complete list of possible keywords in UPA. On the other hand, Fig. 3 illustrates these keywords in a timeline example.

Table 1 Definitions of the UPA keywords

Full size table

As stated in Sect. 1, the UPA keywords will be used within an ordinary SQL query to access a certain version or set of versions of a data item. Among all these version, two versions are usually most searched for by users, which are the last and the last approved. COACT allows for accessing these versions directly from the original table, without accessing the history of the cell in the History table (Sect. 3). For example, a user who wants to update a wiki page will retrieve the last approved version of the page to acquire knowledge about the page data that was accepted by all collaborators, and will also retrieve the last version or the set of pending versions to check what other collaborators have recently worked on. As we will explain in Sect. 4.3, COACT will retrieve a certain version from the original or History table based on the UPA keywords in the query. In addition, other versions of the cell will also be requested in various cases, as we illustrated in the examples in Sect. 1. For example, if a collaborator wants to calculate statistics about the reject rate of a certain wiki page, he can do so in COACT by counting the number of versions in the rejected set and dividing that by the number of versions in the all set. Hence, he will be using the UPA keywords: rejected and all in his SQL query. In general, all UPA keywords are needed in various cases. In this section and in Sect. 5, we present several examples that illustrate the usage of different UPA keywords.

4.1.2 General definitions of UPA options

An important aspect of querying the UPA system is to take into consideration that if pending values exist in the cells that are queried, the results of a query might change after these versions are approved or rejected. For example, consider the following query:

This query retrieves the value of the last approved version of each birth_date cell for each Row where the value of the last inserted version of the salary is greater than 300. Now, suppose that for one of the employees (John), the last approved salary is 250. However, the salary of John was updated to 350, and this new value is still pending the PI approval. Now, if we execute this query, the birth date of John will be included as part of the result of this query, since the last inserted value of the salary of John satisfies the condition of the query. Now, suppose that the PI rejects the salary update. In this case the salary of John will be reverted to 250 and the execution of this query will not return the birth date of John. In order to enable the user to query the set of versions of a cell, taking into consideration the possibility that some cells may be false positives or false negatives (due to the fact that their values will change after the PI approves/rejects pending versions), we define the following three UPA options that can be added to any condition within a UPA query:

True values A cell satisfies the true value condition if the value(s) of the queried version(s) of the cell satisfy the required condition, and no existing possible approves or rejects that can make the cell not satisfy the condition.
False positives A cell is considered a false positive if the value(s) of the queried versions(s) of the cell satisfy the required condition, but there are existing pending versions that could possibly make the cell not satisfy the condition after they are approved or rejected.
False negatives A cell is considered a false negative if the value(s) of the queried versions(s) of the cell doesnt satisfy the required condition, but there are existing pending versions that could possibly make the cell satisfy the condition after they are approved or rejected.

4.1.3 Definitions of UPA options for specific UPA keywords

Note that even though the general definition of the three UPA options is the same, their implementations will differ from one case to another, depending on the UPA keyword that is used with the attribute of the condition. For example, consider the condition: “last age < 18”. Examples of true values, false positives, and false negatives cells for this condition are shown in Fig. 4. The first example of true values (Fig. 4a) is a cell in which the last version is approved and satisfies the condition. The second example of true values is a cell in which the values of the last approved version and all pending versions satisfy the condition. In this last case the cell will remain satisfying the condition, regardless if Version 2 (value = 14) and Version 3 (value = 15) are approved or rejected. With respect to false positives, an example is a cell whose last pending value (Version 3 = 15) satisfies the condition, however, the last approved version (Version 1 = 19) does not satisfy the condition. In this case if Version 3 is rejected in the future by the PI, the cell will no more satisfy the condition: “last age < 18”. Another example of a false positive is a cell that has any of its pending versions doesnt satisfy the condition, such as Example 2 in Fig. 4b. In this case, the last version (Version 3 = 15) satisfies the condition. However, the previous version (Version 2 = 22) doesnt satisfy the condition. So if the PI approves Version 2 and then rejects Version 3, the cell will no more satisfy the condition.

Finally, looking at false negatives examples for the condition “last age < 18”: the first example is a cell where the value of the last pending version doesnt satisfy the condition (Version 3 = 22). However, the value of the last approved version (Version 1 = 12) satisfies the condition. In this case even though the last version of the cell doesnt satisfy the condition, the cell is a false negative since if the PI rejects Version 3, the last version of the cell will become Version 1 which satisfies the condition. Another example is a cell where the last pending version doesnt satisfy the condition. However, one of the previous pending versions satisfies the condition. So, in Example 2 in Fig. 4c, if the PI approves Version 2 (value = 15) and rejects Version 3 (value = 22), the cell will satisfy the condition. Note that in the examples in Fig. 4, the value that will be returned by the condition “last age < 18” is highlighted in bold text.

The previous examples are correct when the UPA keyword used with the attribute in the condition is last. If we change the UPA keyword to last approved, some of the examples in Fig. 4 will become invalid. On the other hand, Fig. 5 shows some valid examples of true values, false positives, and false negatives for the condition “last approved age < 18”. For true values, the examples in Fig. 4a are still valid as true values for the condition “last approved age < 18”. However, for false positives, Example 1 in Fig. 4b cannot be considered a false positive for the condition “last approved age < 18”. Rather, it is a false negative for this condition. While Example 2 in Fig. 4b is valid as a false positive for the condition “last approved age < 18”. Similarly for false negatives, Example 1 in Fig. 4c is not a false negative for the condition “last approved age < 18”, while Example 2 is a false negative for this condition, as illustrated using similar examples in Fig. 5.

After discovering that the general definition for true vales, false positives, and false negatives is not enough to distinguish these cases when using different UPA keywords, we follow another approach. First, after examining the list of UPA keywords, we discovered that the definition of true values, false positives, and false negatives is valid only for some keywords, and doesnt apply for many of these keywords. For example, suppose that the UPA keyword that is used with the attribute is first approved, which returns the first (oldest) version that was approved. Using this keyword, the result will not change if the existing pending versions are approved or rejected, because the first approved version will remain the same version, regardless of any pending versions in the cell. Hence, the definition of true values, false positives, and false negatives does not apply when using the first approved keyword.

Therefore, we studied for each UPA keyword whether the definition of true values, false positives, and false negatives apply or not. The list of keywords that we found in which these definitions can be applied contains: last, last approved, last rejected, and first pending. For each keyword in this list, we identify its specific definition for true values, false positives, and false negatives. For example, for the last keyword, we define the following definitions:

True values A cell is a true value if the last approved value satisfies the required condition, and all pending values (if any) also satisfy the required condition.
False positives A cell is considered a false positive if the last pending value satisfies the required condition, but one of the previous pending values, and/or the last approved value doesn’t satisfy the condition.
False negatives A cell is considered a false negative if the last pending value does not satisfy the required condition, but one of the previous pending values, and/or the last approved value satisfies the condition.

In general, suppose that the last approved version of the cell is \(l_a\), the last pending version is \(l_p\), and the set of pending versions of the cell is \(S_p\). Also, suppose that the condition that precedes the UPA option is C (for example, “salary > 300”). Then the UPA options for the last keyword can be formally expressed as:

True values { C(\(l_a\)) = true \(\wedge \) \(\forall v \in S_p,\) C(v) = true }
False positives { C(\(l_p\)) = true \(\wedge \) \((\exists v \in S_p\) |C(v) = false \(\vee \) C(\(l_a\)) = false ) }
False negatives { C(\(l_p\)) = false \(\wedge \) \((\exists v \in S_p\) |C(v) = true \(\vee \) C(\(l_a\)) = true ) }

Similar to these definitions, we can specify the definitions of true values, false positives, and false negatives for the other three UPA keywords, which are last approved, last rejected, and first pending.

4.2 Using UPA keywords and options within SQL queries

In this section, we describe how the UPA keywords (Table 1), and the three UPA options (true values, false positives, and false negatives) can be used in the UPA Query Interface Language. First, the UPA keywords can be used before any attribute name in the query to specify a specific version or set of versions of this attribute. For example, if we consider the query: “select last salary from Employee”. In this case, the system will get for each Row in the Employee table, the last version of the salary cell (if exists). Another example is using a UPA keyword before an attribute while joining two tables. For example, consider the query:

In this query the two tables (Employee, Address) are joined based on the condition that the last approved value of the examined userID cell from the Employee table is equal to the last approved value of the examined userID cell in the Address table. For example, suppose that in the Employee table, within the Row \(R_{1}\), the last approved value of the userID cell is \(v_{1}\). Also, in the Address table, within Row \(R_{2}\), the last approved value of the userID cell is \(v_{1}\). Hence, \(R_{1}\) and \(R_{2}\) are joined, and the query returns the last inserted value of the Salary cell within \(R_{1}\) (if any).

A third example of using UPA keywords is with the attribute in a condition within a predicate. For example, the query: “select last salary from Employee where pending age > 10” will search for “age” cells in which all pending values are greater than 10. For each found cell, the query will retrieve the last inserted value of the salary cell that is within the same ROW (if any).

With respect to the three UPA options, they can be used as an optional addition at the end of a condition, if the UPA keyword that is used with the attribute in this condition belongs to the set {last, last approved, last rejected, first pending}. Note that when using a UPA option, the user can combine two or more UPA options together using logical AND, OR, or NOT SQL keywords. For example, the UPA option “true values OR false positives” returns the union of results that will be returned when applying true values alone and false positives alone. Another example is [(NOT true values) AND (NOT false positives) AND (NOT false negatives)]. In this case the system will calculate the results that will not be returned when applying true values, the results that will not be returned when applying false positives, and the results that will not be returned when applying false negatives, and will take the intersection of the three sets of results as the result of the above UPA option.

Note that in some cases, the user might query the UPA system without using UPA keywords and/or UPA options in his query. In this case the system will use the default UPA keyword, which is the last keyword, whenever an attribute is not preceded by a UPA keyword. Also, the system will use the true values UPA option as the default UPA option if the user uses one of the keywords last, last approved, last rejected, first pending with the attribute in a condition, without specifying a UPA option. Also, the UPA system allows the user to change the default UPA Keyword and the default UPA option, by the means of the keywords: UPAkeyword and UPAoption that were added to the UPA QIL. For example, consider the following query, where the user wants to use the last approved keyword as a default UPA Keyword for all attributes and the {true values OR false positives} option as a default UPA option for all conditions:

In this query, the user specifies that the UPA Keyword for {salary, age, name} is last approved, and the UPA option for both conditions (age > 10) and (name = ‘John’) is {true values OR false positives}.

In a nutshell, Fig. 6 illustrates the general syntax of a UPA query, which consists of six main keywords: Select, From, Where, Order By, UPAkeyword, and UPAoption. The general structure of a UPA query is similar to that of an SQL query, with the exceptions of adding a UPA_keyword before an expression, and a UPA_option at the end of each predicate. In Fig. 6, an expression could be a column name (i.e., HBase qualifier), a constant, a function, a scalar subquery, or any combination of column names, constants, and functions connected by an operator or operators, or a subquery. If the expression is a column name, a function or a combination of column names and functions in addition to constants and/or scalar queries, then it is possible to insert a UPA_keyword before the expression. However, if the expression is a constant, a scalar query, or a combination of constants and scalar queries, then it is not applicable to add a UPA_keyword before the expression. Note that a UPA_keyword is one of the nineteen keywords stated in Table 1, while UPA_option can be either true values, false positives, false negatives, or any combination of these three using NOT, AND, or OR. Also, note that the current UPA query does not support advanced SQL features such as Group by and Having clauses. It also does not support using UPA keywords and options within nested subqueries; i.e., by using IN, EXISTS, ALL, and SOME keywords. An intended future work is to expand the capabilities of the UPA QIL to include these missing features.

4.3 Implementing UPA keywords and options in Phoenix

The UPA Query Interface Language (QIL) was implemented as a separate layer on top of Apache Phoenix, which is a massively parallel relational database engine supporting online transaction processing (OLTP) for Hadoop using Apache HBase as its backing store. It enables users to access large datasets in real-time. It also abstracts away the underlying HBase data store by enabling the user to query the data with standard SQL via JDBC driver. Apache Phoenix provides features such as secondary indexes to help speed up the queries without relying on specific row key designs of HBase.

The UPA layer on top of Phoenix, which we refer to as the UPA parser, performs a scan over the UPA query to identify the UPA keywords and options within the query. When the UPA parser detects a UPA keyword or a combination of keyword/option in the query, it calls a reWrite() function that takes as an input the UPA query, the UPA keyword or keyword/option, and a flag that specifies the location of the keyword or keyword/option within the query. The reWrite() function returns a modified query in which the keyword or keyword/option is replaced by one or more nested SQL queries, and/or by a condition (i.e., SQL predicate) that are added to the original query. Each time the reWrite() function returns a modified query Q’, the UPA parser searches Q’ for any UPA keyword or keyword/option. If a keyword or keyword/option is found within Q’, the latter is recursively passed to reWrite(), along with the first keyword or keyword/option that was found in Q’. This recursive operation continues until the returned Q’ does not contain any UPA keywords or keyword/option (stop condition). Hence, the final Q’ is passed to Phoenix for ordinary SQL execution process. The recursive reWrite() function, which constitutes the bulk of the UPA parser, ensures that each UPA keyword and keyword/option pair within the query are rewritten in ordinary SQL format before passing the rewritten query to Phoenix. Examples of UPA queries and the resulting rewritten SQL queries will be described in the next section.

To illustrate the operations of the reWrite() function, consider the UPA query: Q = “select last approved salary from Employee”. When the UPA parser scans Q for UPA keywords or combinations of keyword/option, it detects the last approved UPA keyword. Hence, it passes Q to the reWrite() function, along with the UPA keyword, and a flag that specifies that the UPA keyword is in the select statement. Note that when a UPA option is used after a condition, the UPA keyword that precedes the attribute in the condition and the UPA option are processed together. The first step in the reWrite() function is to check whether the required values should be obtained from the Original table or from the History table. This will depend on the UPA Mode of each of the required cells. In this example, we want the last approved version of each salary cell. If this cell is operating in UPA Mode 2 or 3, the last approved value will be retrieved from the Original table. Hence, the from clause will be rewritten as “from Employee as E”. On the other hand, if the cell is in UPA Mode 1, the last approved value will be retrieved from the History table. Hence, the from clause will be rewritten as “from Employee.History as H” (assuming that the History table of Employee is named in HBase “Employee.History”).

In our example, suppose that the cell is operating in UPA Mode 1; hence, the first rewritten query Q’ will be “select last approved salary from Employee.History as H”. In general. When we need to retrieve a version from the History table, we replace the UPA Keyword in the select clause with a nested query and a condition in the where clause. On the other hand, if the version is to be retrieved from the Original Table, the UPA keyword is replaced with a condition in the where clause only (i.e., we do not need a nested query). The reason for this is that the History table contains many versions of the cell, which requires a nested query to select the needed version or set of versions according to the UPA keyword. On the other hand, the original table contains only one value, which is accessed directly without the need for a nested query. The condition that is added in both cases is used to specify the required cell among all cells in the table. In our example, we need to access the last approved version of each salary cell. Hence, we need the condition “where H.‘rowId’ like ‘Employee:%:salary:%’ ” to specify the salary cell in the History table, and we need the nested query

to access the last approved version of each salary cell we are examining in the outer query. The H2.“status” = ‘approved’ condition will select only approved values of the cell, while the MAX(H2.“timestamp”) condition in the select clause will retrieve the version with the maximum timestamp among the approved values; i.e., the last approved value. After rewriting the query by replacing the last approved UPA keyword with these nested query and condition, the rewritten query will become:

which does not contain any UPA keywords or keyword/option. Hence, it is passed to Phoenix for ordinary SQL execution.

In addition to the the UPA parser on top of Phoenix, the UPA QIL uses a set of HBase filters to assist Phoenix while executing the rewritten query. An HBase filter is a piece of code that runs on the server side while executing an HBase Get or Scan to filter the obtained results. The importance of using HBase filters is that they run as a part of the code that calculates the query result on the region where the data resides. Hence, they provide an efficient way of filtering the results while they are calculated. In the UPA system, we implemented a set of filters within HBase that filter the searched versions based on the status field of each version. In our example, when the Phoenix parser reads the (H2.“status” = ‘approved’) statement within the above query, it adds the Filter_Approved filter to the HBase scan that will be used by the inner query. When Phoenix executes the query, it uses the Filter_Approved filter while scanning the salary versions in the History table to filter away any version whose status is not equal to approved and use only versions whose status is equal to approved. Using HBase filters will reduce the overall time of executing the query, as compared to checking the status of each version individually.

5 Experimental evaluation

5.1 Testing environment

In order to test the performance of COACT, we created a virtual HBase Cluster that contains twenty five virtual machines (VMs) divided into two sub-clusters: the first sub-cluster contains fifteen VMs that store the HBase database on which the UPA operations are executed, while the other sub-cluster consists of ten VMs, each of which hosts a client who executes the UPA Insert and Search queries on the database. In the experiments, we tested two different workloads: In Workload 1, the insertion frequency is high, while the search frequency is low. During the simulation scenario of Workload 1, each client issues a UPA Insert query each 10 milliseconds, and a UPA Search query (one of the three queries explained below) each 100 ms. Workload 1 represents the case of a new experiment, when data is rapidly generated and stored while searching for data by users is less frequent because data is still unstable. On the other hand, Workload 2 simulates the case when data has stabilized, and few updates are made, while searching for data is more frequent by users who need the data. Hence, in Workload 2 each client will issue an insert query each 100 ms and a search query each 10 ms. The simulation time for the two Workloads was set to 10,000 s, and the query delays were recorded during the last 1000 s (after sufficient data has been inserted to the DB). The UPA insert and search queries were uniformly distributed among the ten clients.

The data used to populate the HBase database was extracted from dumps provided by Wikipedia [26], which gives the history of updates performed by Wikipedia users on Wikipedia webpages. We consider each webpage as a data cell and each update on the page as a new version of the data cell that will be written as a new record to the History table. The resulting HBase database contains two original tables: Wiki1 and Wiki2. Each table contains three attributes (i.e., data cells): The webpage ID (WebpageID), the webpage size in bytes (size), and the webpage md5 (md5). The Wiki1 table is used to store updates on webpages that contain images, while Wiki2 is used to store updates on webpages that contain videos. As explained in Sect. 3, each of Wiki1 and Wiki2 will have its History table that will store the complete update history of each webpage in the original table, while the original table will store the last-approved, last-inserted, or both last-approved and last-inserted updates on the webpage, based on the webpage UPA Mode. The data from [26] contain only approved records. For testing purposes, we insert each new record into the database with a pending status, then we execute a PI algorithm for randomly approving or rejecting pending records. The PI algorithm works as follows: it takes as an input the pending and rejection percentages (\(P_p\) and \(P_r\)), and the execution interval \(T_I\). The PI algorithm is executed by the system each \(T_I\) seconds. After the algorithm finishes execution, the records in the databases are divided into three sets: pending set that contains \(P_p\) percent of all records, rejected set that contains \(P_r\) percent, and approved set that contains (\(100 - P_p - P_r\)) percent of all records in the database. The choice of records that should be approved or rejected is done randomly by the PI algorithm. In the simulations, we set the values of \(P_p\), \(P_r\), and \(T_I\) to 30, 30, and 100 respectively.

The Wikipedia data from [26] is formatted in XML files, so we used the SAX XML parser to extract for each webpage version the pieces of data that will be inserted in the HBase database, such as the webpage ID, size, md5, and other metadata such as the user ID, timestamp, comment, ... and save these fields in a csv file. The data in the csv files will be read by the ten clients and inserted in the HBase database. In our experiments, we populated the Wiki1 and Wiki2 tables with 10 million records when testing Workload 1, and 1 million records when testing Workload 2. Hence, considering that a record in Wiki1 or Wiki2 has a total size of 152 bytes, and considering that each record in Wiki1 and Wiki2 will create 5 records in the corresponding History table, the total size of the database at the end of the simulations will be equal to 8.5 GB for Workload 1 and 0.85 GB for Workload 2. Note that the record size was calculated based on the method presented in [27]. Once a record is read from the input file, it is inserted into Wiki1, Wiki2, or both tables according to whether the webpage of this record contains images, videos, or both.

5.2 Tested queries

We present in this section the results of three UPA queries that were executed on the HBase database. The three queries have different complexity levels, and their execution was done with different UPA options. The first query (Query 1) can be described as follows: Suppose that the database administrator wants to select all webpages whose last approved update contains videos, and whose last approved size is greater than a certain threshold S, in order to move these webpages into a separate table in the database. The administrator wants to get the webpage ID, size and md5 of the last update of each such page. The resulting UPA query would be:

where S is the size threshold. On the other hand, the second tested UPA query (Query 2) is described as follows: Suppose that the database administrator wants to select all webpages whose last approved update contains both images and videos, and whose last approved size is greater than a certain threshold S, in order to move these webpages into a separate table in the database. The administrator wants to get the webpage ID, size and md5 of the last update of each such page. The resulting UPA query would be:

Finally, the third UPA query (Query 3) can be described as follows: we want to find the webpages that contain both images and videos in any of their approved versions, such that any of the rejected versions of such webpage has a size greater than S, and the first pending version contains the substring ‘abc’ in its md5. For each such webpage, we want to retrieve all its pending values of WebpageID, size, and md5. The resulting UPA query will be:

For each of the two workloads, we tried six values of the size threshold S, which are: 10, 100, 1000, 10,000, 100,000, and 500,000. Also, each of the three queries was tested with the three UPA options: true values, false positives, and false negatives (by replacing UPA-Option in Queries 1, 2, and 3 with one of these UPA options). Whenever a client wants to issue a UPA search query, it chooses randomly one of the three queries (Query 1, Query 2, or Query 3). This setup ensures a similar distribution of the three queries among the ten clients. Overall, we executed eighteen scenarios for each workload as follows: for each of the three UPA options, we tested a scenario in which S was set to one of the six values stated above. Each scenario was repeated ten times, and the averages of the ten times were calculated. Note that each of the three UPA queries will be rewritten by the UPA parser as we explained in the previous section. The rewriting stage depends on the UPA mode of the examined cells. In Fig. 7, we illustrate an example of rewriting Query 1 and Query 2 while using the true values option, and while assuming that the cells in Wiki1 and Wiki2 are operating in UPA Mode 1. Also in Fig. 7, the value of S in the two queries is equal to 1000, and History1 and History2 are the History tables for Wiki1 and Wiki2 respectively.

5.3 Results

Figure 8a, b and c illustrate the total query execution delays for Query 1, Query 2 and Query 3 respectively, while running Workload 1, and while varying S between 10 and 500,000. The figures also show the query delays for the three UPA options. Note that the results in Figs. 8 and 9 are calculated as the average delay of each query during the last 1000 s of the simulation scenario. From Fig. 8, we notice that the total execution delay of Query 1 ranges between 1 and 3.7 s, and that of Query 2 ranges between 1 and 4.4 s, while for Query 3 it ranges between 4.7 and 7 s. This increase in execution delay is expected, since Query 2 is more complex than Query 1, and Query 3 is more complex than Query 2. While Query 1 contains a simple condition only, Query 2 contains a join between the two History tables: History1 and History2. On the other hand, Query 3 contains a join that is based on comparing two sets of versions, and not on comparing a single version as in Query 2. Also, Query 3 contains two conditions with UPA keywords and options in its where clause, while Query 2 contains a single condition only. Note that the two History tables are approximately equal in size, since the records from the Wikipedia files were equally distributed among the original tables (Wiki1 and Wiki2). However, the increase in delay from Query 1 to Query 2 is not linear; i.e., the delay does not double. Also, the execution delays of the three queries converge as S increases, which reflects the stability of the system. For Query 3, the execution delay is much higher than that of Query 1 and Query 2. However, the delays of the three UPA options of Query 3 converge to similar values as S increases. These delays starts to converge after S = 10,000, due to the decrease in the number of records in the result, as can be seen in Table 4.

An important point to notice in Fig. 8a, b and c is that the delay behavior of true values and false positives is similar for the three queries, while that of false negatives fluctuates as S varies. For example, the delay of false negatives decreases while executing Query 1 and Query 2, but increases while executing Query 3. This is mainly due to the fact that the UPA option is involved in the condition that contains S in Query 1 and Query 2. On the other hand, the condition that contains S in Query 3 does not involve the UPA option (since the UPA keyword of this condition is any rejected, which does not take a UPA option). The UPA option in Query 3 is only for the condition “md5 like ‘%abc%’”. This fact causes the change in the behavior of the delay of false negatives between {Query 1, Query 2} and Query 3. Also, the delay of true values is generally higher than that of false positives and false negatives for the three queries, which is expected due to the much higher result size of true values, as can be seen in Tables 2, 3, 4, 5, 6, and 7. Finally, we notice that the delay behaviors of false positives and false negatives are opposite for the first two queries, since the delay of false positives increases as S increases, while that of false negatives decreases. This is expected since their meanings (i.e., false positives and false negatives) are opposite, which intuitively implies their execution plans will be different.

Table 2 Number of records in the output result of Query 1 while running Workload 1

Full size table

Table 3 Number of records in the output result of Query 2 while running Workload 1

Full size table

Table 4 Number of records in the output result of Query 3 while running Workload 1

Full size table

Table 5 Number of records in the output result of Query 1 while running Workload 2

Full size table

Table 6 Number of records in the output result of Query 2 while running Workload 2

Full size table

Table 7 Number of records in the output result of Query 3 while running Workload 2

Full size table

With respect to Workload 2, its results are illustrated in Fig. 9, with the same parameters that were tested for Workload 1. We notice from the results of Workload 2 that the general behavior of the three queries is very similar to Workload 1, with a main difference in the obtained values, which are much higher for Workload 2. On average, the delay of Query 1 increases from 2.3 to 8.5 s, that of Query 2 from 3.3 to 14.7 s, and that of Query 3 from 5.8 to 20 s. It is expected that the delay of a Query will increase when increasing the search frequency, due to the higher load on the system that results from the larger number of queries that are competing to execute.

Finally, we take a look at the output result sizes of the three queries while testing the two Workloads. The output size in Tables 2, 3, 4, 5, 6 and 7 is the number of records (i.e., rows) returned by the query. We notice that the true values results are generally much higher than those of false positives and false negatives, which reflects the fact that in our database there is a much higher chance to find webpages that satisfy the true value conditions of the three queries. Also, the output result size of true values decreases as S increases, while the output result sizes of false positives and false negatives are small when S is small, and then they increase as S increases until they reach a maximum, and then they start decreasing again for high values of S. This behavior is expected since when S is small, most results are true values and hence the remaining results which are false positives and false negatives are few. As S increases, the number of webpages whose size is greater than S decreases, and hence the number of results of all three UPA options will decrease for high values of S. Also, we notice that the output sizes of Workload 1 are much higher than those of Workload 2, which is logical since the insertion frequency is higher in Workload 1, which will result in a higher number of records in the database.

In general, we can deduce that the execution delays of the three UPA queries are acceptable, considering the fact that the system is searching within ten million records in Workload 1 and one million records in Workload 2, and also considering the fact that the UPA system creates additional overhead by saving the complete history and metadata of each webpage in the History table. However, the execution delays does not increase linearly. Rather, they remain within an acceptable limit.

6 Conclusion

In a previous work, we presented a collaborative database system that enables collaboration between users updating the same data items. Our UPA framework handles the approval and rejection of data and enable users to view its status and track its history. In this paper, we presented a high level SQL-based query interface language for administrators and collaborators to interact with the UPA framework. We presented the basic operations of the system and discussed the definitions and implementation of various UPA keywords and options. We tested the performance of COACT by executing three UPA queries with different complexities on a sample HBase database. The results show an accepted query execution delay for all queries, and the stability of the system while using various values of the tested parameters.

References

Fagin, R.: On an authorization mechanism. ACM Trans. Database Syst. 3(3), 310–319 (1978)
Article MathSciNet Google Scholar
Griffiths, P.P., Wade, B.W.: An authorization mechanism for a relational database system. ACM TODS 1(3), 242–255 (1976)
Article Google Scholar
Mershad, K., Malluhi, Q., Quzzani, M, Tang, M., Aref, A.: Approving updates in collaborative databases. In: Proceedings of the 3rd IEEE International Conference on Cloud Engineering. IC2E 15, March (2015)
Mershad, K., Malluhi, Q., Quzzani, M, Tang, M., Gribskov, M., Aref, A.: AUDIT: Approving and Tracking Updates with Dependencies in Collaborative Databases. Distributed and Parallel Databases. Springer, Berlin. https://doi.org/10.1007/s10619-017-7208-y (in press)
Apache HBase. [Online]. https://hbase.apache.org/
Rose, E., Segev, A.: TooSQL-a temporal object-oriented query language. In: Entity-Relationship Approach ER’93, pp. 122–136. Springer, Berlin (1994)
Snodgrass, R.: An overview of the temporal query language TQuel. University of Arizona, Department of Computer Science (1992)
Snodgrass, R.T.: The TSQL2 Temporal Query Language, vol. 330. Springer Science & Business Media, New York (2012)
MATH Google Scholar
Jensen, C.S., Lomet, D.B.: Transaction timestamping in (temporal) databases. In: VLDB, 2001, pp. 441–450
Lomet, D., Barga, R., Mokbel, M.F., Shegalov, G., Wang, R., Zhu, Y.: Immortal DB: transaction time support for SQL server. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. ACM, 2005, pp. 939–941 (2005 )
Lomet, D., Hong, M., Nehme, R., Zhang, R.: Transaction time indexing with version compression. Proc. VLDB Endow. 1(1), 870–881 (2008)
Article Google Scholar
Abdessalem, T., Jomier, G.: VQL: a query language for multiversion databases. In: Database Programming Languages. Springer, Berlin, pp. 160–179 (1998)
Proll, S., Rauber, A.: Scalable data citation in dynamic, large databases: model and reference implementation. In: Big Data, 2013 IEEE International Conference on. IEEE, pp. 307–312 (2013)
Meimaris, M., Papastefanatos, G., Viglas, S., Stavrakas, Y., Pateritsas, C.: A query language for multi-version data web archives (2015). arXiv:1504.01891
SPARQL Query Language for RDF. [Online]. http://www.w3.org/TR/rdf-sparql-query/
Cypher—the Neo4j query Language. [Online]. http://www.neo4j.org/learn/cypher
He, H., Singh, A.K.: Graphs-at-a-time: query language and access methods for graph databases. In: Management of data, 2008 ACM SIGMOD International Conference on. ACM, pp. 405–418 (2008)
Hong, S., Chafi, H., Sedlar, E., Olukotun, K.: Green-Marl: a DSL for easy and efficient graph analysis. In: Proceedings of the 17\({{\rm th}}\) International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, pp. 349–362 (2012)
Jindal, A., Madden, S.: Graphiql: a graph intuitive query language for relational databases. In: Big Data, 2014 IEEE International Conference on IEEE, pp. 441–450 (2014)
Tinkerpop, Gremlin. [Online]. https://github.com/tinkerpop/gremlin/wiki
Apache hadoop. [Online]. http://hadoop.apache.org/
Hadoop distributed file system. [Online]. http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html
Apache hive. [Online]. https://hive.apache.org/
Cloudera impala. [Online]. http://impala.io/
Apache phoenix. [Online]. http://phoenix.apache.org/
Wikimedia downloads. [Online]. https://dumps.wikimedia.org/
How to calculate the record size of HBase? [Online]. http://prafull-blog.blogspot.com/2012/06/how-to-calculate-record-size-of-hbase.html/

Download references

Author information

Authors and Affiliations

Arts, Sciences and Technology University in Lebanon, Beirut, Lebanon
Khaleel Mershad
Qatar University, Al Hala St, P.O. Box 2713, Doha, Qatar
Qutaibah M. Malluhi
Qatar Computing Research Institute, HBKU, Doha, Qatar
Mourad Ouzzani
Purdue University, West Lafayette, IN, USA
Mingjie Tang, Michael Gribskov & Walid G. Aref
Shri Mata Vaishno Devi University, Kakryal, J&K, India
Deo Prakash

Authors

Khaleel Mershad
View author publications
You can also search for this author in PubMed Google Scholar
Qutaibah M. Malluhi
View author publications
You can also search for this author in PubMed Google Scholar
Mourad Ouzzani
View author publications
You can also search for this author in PubMed Google Scholar
Mingjie Tang
View author publications
You can also search for this author in PubMed Google Scholar
Michael Gribskov
View author publications
You can also search for this author in PubMed Google Scholar
Walid G. Aref
View author publications
You can also search for this author in PubMed Google Scholar
Deo Prakash
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khaleel Mershad.

Additional information

This publication was made possible by the support of an NPRP Grant 4-1534-1-247 from the the Qatar National Research Fund (a member of Qatar Foundation) and the National Science Foundation under Grants IIS-1117766 and IIS-0964639. The statements made herein are solely the responsibility of the authors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mershad, K., Malluhi, Q.M., Ouzzani, M. et al. COACT: a query interface language for collaborative databases. Distrib Parallel Databases 36, 121–151 (2018). https://doi.org/10.1007/s10619-017-7213-1

Download citation

Published: 13 November 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s10619-017-7213-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

COACT: a query interface language for collaborative databases

Abstract

Similar content being viewed by others

AUDIT: approving and tracking updates with dependencies in collaborative databases

MySQL Collaboration by Approving and Tracking Updates with Dependencies: A Versioning Approach

User Trust and Judgments in a Curated Database with Explicit Provenance

1 Introduction

2 Related work

2.1 Temporal databases

2.2 Transaction time database systems

2.3 Multiversion databases

2.4 Web data evolution

2.5 Graph databases

2.6 Data citation

3 Overview of UPA

3.1 HBase basics

3.2 History tracking

3.3 Supporting UPA

3.4 The approval process

3.5 The UPA modes

4 UPA query interface language

4.1 UPA keywords and options

4.1.1 Definitions of UPA keywords

4.1.2 General definitions of UPA options

4.1.3 Definitions of UPA options for specific UPA keywords

4.2 Using UPA keywords and options within SQL queries

4.3 Implementing UPA keywords and options in Phoenix

5 Experimental evaluation

5.1 Testing environment

5.2 Tested queries

5.3 Results

6 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation