Keywords

1 Introduction

References to big data mention the common characteristics of being unstructured. This opened the possibility of skipping a conceptual modeling phase. On the other hand, the benefits of using database conceptual models have been acknowledged for decades; however, the conceptual design domain for NoSQL repositories is still at a research stage leading to poorly-designed systems. The modeling does not aim to enforce structure over data, but it helps to understand how data is organized for analysis [1].

Document data stores are the second most popular data model [2] and are similar to a key-value model with a difference of having self-describing, hierarchical, and examinable value. The usual practice in document datastores is to skip the conceptual design taking directly implementation aspects into account. Even though this practice can give positive results for small systems, it becomes more difficult for more complex ones.

In this paper, we propose an extension of UML class diagrams for representing document stores and mapping rules, showing the implementation in the three document stores [2]. In our approach, we look for simplicity and for bridging the gap between academics and practitioners.

This paper is organized as follows: Sect. 2 refers to related works, Sect. 3 introduces our proposal for conceptual modeling. Sect. 4 shows mapping rules for the implementation and Sect. 5 includes a case study using the proposed conceptual model and its implementation in document stores; lastly, Sect. 6 gives conclusions and future work.

2 Related Work

Currently, there are few systematic studies on data modeling for NoSQL databases, e.g., [1, 3]. Some works propose particular solutions that can be used for conceptual modeling of NoSQL databases [4, 5]; however, since the complexity of these approaches is high, they can be difficult to infiltrate into real-world applications. On the other hand, several studies refer explicitly to modeling documents in MongoDB using UML notation [6]; other approaches refer to the JSON format to represent the documents for different NoSQL databases [7]. Others develop automatic tools to map from JSON [8] and applying reverse engineering to already deployed systems [9].

In this work, we do not consider performance evaluation; however, this aspect is important and we plan to extend this research since different reports are contradicting. For example, [10] shows a better performance for indexed referenced documents compared to embedded ones, but the results of [11] demonstrate the opposite conclusions.

3 Conceptual Representation of Document Data Stores

Using a conceptual model in a document data store provides the advantage of representing data in a way that helps to understand, access, and analyze it from the beginning of the implementation process. Lacking the model forces the implementers to retain the details of data “structure” considering an implementation level that can be complex in the presence of different document collections.

Conceptual modeling is a product-independent design allowing its creators to focus on user requirements and implement the system, if adequate logical/physical mapping rules are established [12]. The proposed conceptual model uses the UML class diagram in a similar way as the conceptual modeling is done in the relational databases.

3.1 Document with Fields

A document is the main element and presents a set of data in an organized form, even though its structure can differ from other documents. Each document contains fields; one of them is reserved for document id. Since documents can have different fields, we propose to choose as a representative document the one that includes all fields indicating some fields as optional. Figure 1b shows an example where two fields are included in all documents, e.g., movieId, and movieTitle, and one is optional, e.g., language, indicating this by the symbol of “~” before the name. Other fields group elements in an array, e.g., genres; this data type with its cardinalities is indicated in square parenthesis.

Fig. 1.
figure 1

Document representation: (a) collection, (b) document itself, and (c) an embedded one.

3.2 Document Collection

The collection represents a grouping of similar documents. Compared to relational databases, a document could correspond to a row and a collection of documents to a table. Figure 1a shows a graphical representation for a collection using the UML package. We use the symbol of contention relationship (⨁) to indicate that documents form part of a collection, i.e., its membership [13].

3.3 Embedded Documents

The field in the document can refer to another document forming nested documents. We propose two different UML notations to represent this: a composition relationship and an aggregation relationship. Figure 1c shows a general form for representing the composition relationship with an example of movies and their ratings. This embedded document includes the name, its multiplicity (0..* in Fig. 1c), and the specification of its fields. This kind of relationship is required when a nested document existence depends on its container document, e.g., movie rating is part of a specific movie. Our approach is different from [6] since the last one, additionally, requires class inheritance that increases unnecessarily the complexity of the model.

On the other hand, when a nested document depends on its container but, if necessary for the further extension, they can be converted to standalone documents, we propose to use the aggregation relationship (♢), e.g., the movie storyline is closely related but not strictly dependent to the movie.

3.4 Referenced Documents

When a collection is related to two or more other collections, it is necessary to define a relationship between them to avoid data repetition. To represent this relationship, we propose to use a bi-directional association as it can be seen in Fig. 2 (Directed by relationship). This representation includes multiplicity values in the (min, max) form to indicate the number of documents that should participate (min value) in the association and number of documents from one collection that can be associated with documents from another collection (max value).

Fig. 2.
figure 2

An example of a referenced collection.

Figure 2 shows two document collections representing movies and directors. Since a movie can be directed by one or more directors and the director can lead some other movies, the cardinality is many-to-many (indicated by the * symbol). Further, not all movies have a specified director (min value is 0), but all directors have associated at least one movie (min value is 1).

4 Mapping Rules

After outlining the conceptual proposal, we define the following mapping rules using the JSON markup language. It emulates an intermediate stage for the design of the data store, similar to the logical representation in relational databases that allows one to specify relations before their deployment in the particular DBMS. The translation from JSON to the physical level according to the specific system is a straightforward task and may consider other aspects, such as indexing, sharding, replication (if available), among other features. Furthermore, it is possible to automate the mapping process from UML to JSON based on already existing tools, such as crowd [8].

4.1 Document with Fields and Document Collection

To represent a document, we use the JSON specification as shown in Fig. 3 for a document conceptual representation in Fig. 1. Each field is represented in a key-value fashion with the key (field name) between quotes, followed by the associated value. In addition, to represent an array (genres in the figure), values separated by commas are included in square parentheses. Additionally, JSON file can include many documents arranged in an array forming a collection.

Fig. 3.
figure 3

JSON file representing a document from Fig. 1.

4.2 Embedded Documents

The mapping of embedded documents is based on the commonly-known principles used for object-relational databases [14]. Even though document data stores do not belong to this group, general practice demonstrates the use of this mapping [13]. The following rules are applied considering the multiplicity shown on the conceptual level:

  • (0..1) or (1..1): indicates the existence of none or only one embedded document; this document can be represented as such or its fields can be merged with the fields of the main document.

  • (0..*) or (1..*): indicates the existence of none, one, or many embedded documents; these nested documents are organized as an array stored in one field of the main document. Each related document is an element of the array.

Notice that we do not consider a many-to-many relationship between main and embedded documents since it would indicate that some “external” documents are referencing an embedded document. We consider that if the embedded document must be accessed by other “external” documents, it should be modeled as a collection of documents with the corresponding association relationship.

4.3 Referenced Documents

Mapping of referenced documents, similar to the previous case, is based on known principles from object-relational databases [14] according to the following rules:

  • One-to-one cardinality: the document key is included as a field in another document.

  • One-to-many cardinality is mapped in two ways: each document on the n-side cardinality stores the key of the document from the one-side cardinality or each document on the one-side cardinality stores an array of keys from the n-side cardinality.

  • Many-to-many cardinality is also be mapped in two ways: documents in one or both collections include an array of identifiers of corresponding documents from another collection.

Figure 4 illustrates the case of two documents with a many-to-many cardinality between collection of documents: Fig. 4a represents a movie with an array of corresponding director IDs, while Fig. 4b includes arrays with associated movie IDs.

Fig. 4.
figure 4

Implementation in JSON for referenced many-to-many documents.

5 Case Study

After presenting the conceptual model and its mapping rules, we propose a case study to demonstrate the use of the notation and its deployment according to [15] guidelines.

5.1 Case Delimitation

Objective.

We evaluate the application of the proposed notation in different open source products for document storing. The main hypothesis is that the implementation is common among systems without incurring to particular additional requirements that could change the proposed conceptual schema.

Related Cases.

Many academic works include examples of using document stores as relational implementations (e.g. [3, 13]) or modeling, particularly, in MongoDB [1].

Methodology.

Using a qualitative approach, we design a conceptual model for a document data store and implement it in selected data stores. We overview the raw data [16] to define requirements for data store design. Afterward, we develop the schema (Sect. 3) and map it (Sect. 4), showing the implementation differences.

Limitations.

This model does not include some possible optimizations for each system, such as indexes and buckets that could be included after the model is deployed.

5.2 Conceptual Representation

Considering Twitter messages, it is possible to identify data referring to users and messages, leading to conceptual schema showed partially in Fig. 5.

Fig. 5.
figure 5

An extract of a conceptual representation of Twitter data.

As can be seen in Fig. 5, the Tweets and Users collections include their own fields, e.g., tweetId, text or userId, name, among others. Furthermore, the tweet document refers to two optional embedded documents: Place that may appear at most one time and Media that may consist of several documents representing different media types. In addition, the Tweets collection is related to the Users collection through two relationships: Tweet and Retweet. According to shown multiplicity, no all users publish tweets, but every published tweet must have an associated (and only one) user. The Retweet association shows the possibility that a message can be republished by many users and these users can republish many messages.

5.3 Applied Mapping for Implementation

MongoDB Implementation.

Figure 6 shows an example of a tweet in MongoDB mapped according to rules in Sect. 4. Fields of embedded document Place are included in the main document (lines 5–8) with an optional field (coordinates, line 5). The composition Media object is embedded as an array (lines 9–10) considering its multiplicity (only two elements are shown). In addition, userId (line 3) represents a one-to-many association relationship Tweet between Tweets and Users collections.

Fig. 6.
figure 6

Example in MongoDB of embedded (Place and Media) and referenced (Users) documents from Fig. 5.

CouchDB Implementation.

This store does not include the concept of collection; therefore, it is necessary to insert the documents in the same database with a field identifying its type, e.g., Tweets or Users. This adaptation is minimal and does not affect the conceptual model. Figure 7 shows an example of a tweet document in CouchDB with the Retweet relationship. This many-to-many cardinality is shown as an array of users (the field retweetUserId, line 8). Also, we include the field type (line 3) to define its collection.

Fig. 7.
figure 7

Example in CouchDB of documents with many-to-many cardinality.

ArangoDB Implementation.

ArangoDB supports a user-defined unique _key to identify each document. In addition, it includes the _id field, which is the combination of the collection name and the document key (Fig. 8, lines 1 and 2). Particularly, both CouchDB and ArangoDB include a revision value (Fig. 7, line 2 and Fig. 8, line 3) in order to support concurrency control, which does not affect the conceptual level design.

Fig. 8.
figure 8

Example in ArangoDB with keys combination and revision value support.

6 Conclusions and Future Work

The growing use of NoSQL databases and the increasing amount of data in these repositories make the understanding of their “structure” increasingly difficult. The affirmation that NoSQL databases, in particular, document data stores, manage semi-structured data opens the possibility of skipping the conceptual phase; this phase is important since it helps in understanding the nature of data and relationships existing between different elements. As a consequence, it facilitates the expression of queries to analyze data. Although it is well-known that documents can have different fields, there is a clear tendency to create a document collection in order to group “similar” documents.

In this paper, we propose the use of UML class diagrams to represent document stores on a conceptual level. We also include mapping rules that facilitate the document data stores implementation, showing examples of three data stores. Even though the proposed model and mapping rules can be extended, we expect that the simplicity of this conceptual proposal may be appealing to a wide forum of document data stores implementers.