Keywords

1 Introduction

Big Data involves the management of large datasets that due to its size and structure exceed the capabilities of traditional programming tools for collecting, storing, and processing data in a reasonable time. In data generation the main big data sources are users, applications, services, systems, sensors, and technological devices, among others [18]. Them all contribute to Big Data in the form of documents, images, videos, software, files with a multi-diverse format style. The huge volume and heterogeneity present in Big Data applications contribute to the complexity of any engineering process involved.

Currently, different kinds of Big Data applications can be identified such as Recommendations, Feature Prediction and Pattern Recognition [72]. The real-life domains of big data applications include smart cities, smart carts, healthcare systems, financial, business intelligence, environmental control and so on.

The importance and relevance that Big Data is acquiring these days and the promising future we can expect on this knowledge area has been discussed widely. The lack of research on the adequate test modelling and coverage analysis for big data application systems, and the clear practitioners’ demand for having stablished a well-defined test coverage criterion is an important issue [62]. In addition, how to effectively ensure the quality of big data applications is still a hot research issue [72].

For defining a Big Data quality evaluation system, is necessary to know what quality models have been investigated and proposed. The reminder of this study is organized as follows: Sect. 2 exposes related papers where similar topics were investigated; Sect. 3 presents the research methodology used to conduct this study. In Sect. 4 we reveals answers to our research questions, and finally in Sect. 5, we discuss considerations based on our analysis and threats to validity our mapping study.

2 Related Work

In [52] a review on key non-functional requirements in the domain of Big Data systems is presented, finding more than 40 different quality attributes related to these systems and concluding that non-functional requirements play a vital role at software architecture in Big Data systems.

[71] presents another review that evaluates the state of the art of proposed Quality of Services (QoS) approaches on the Internet of Things (IoT) where one of the research questions mentions the quality factors that quality approaches consider when measuring performance.

The research that comes closest to the actual work is [53]. In this paper a SMS is presented involving concepts like “quality models”, “quality dimensions” and “machine learning”. A selection of 10 papers is done where some quality models are reviewed and a total of 16 quality attributes are presented that have some effects on machine learning systems. Finally, the review is evaluated by conducting a set of Interviews with other experts.

3 Methodology

By applying a SMS, we attempts to identify the quality models that have been proposed to evaluate Big Data applications in the last decade by making a distinction between the different types of quality models applied in the context of Big Data applications.

3.1 Definition of the Research Questions

  • RQ1: What quality models related to Big Data have been proposed in the last 10 years?

  • RQ2: For which Big Data context have these quality models been proposed?

  • RQ3: What Big Data quality characteristics are proposed as part of these quality models?

  • RQ4: Have Big Data quality models been proposed to be applied to any type of Big Data application?

3.2 Inclusion/Exclusion Criteria

In Table 1, pre-defined criteria for inclusion/exclusion of the literature are presented. Papers included are published between 1st January 2010 and 31st December 2020 whose main contribution is the presentation of new or adapted quality models for Big Data applications or even the discussion of existing ones. A small database was prepared which contained all documents and a special column was defined to determine the expected level of correlation that the document might have with the topic being investigated.

Papers excluded where those duplicated in different databases or published in Journals and Conferences with the same topic. Papers which could not answer any of the research questions proposed were equally excluded. In addition, those that could not be accessed, or an additional payment was requested for access the full content, or with non-English redaction were similarly excluded.

Table 1. Inclusion and exclusion criteria

3.3 Search and Selection Process

The following search string: “Big Data” AND “Quality Model” was defined to obtain papers which correlate these two concepts. After processing the results of these searches, other papers could be analyzed by the applying “snowballing”.

Because the amount of information we are trying to collect, analyze and classify, we will focus on this paper the search on scientific databases like SCOPUS and ACM. Figure 1 resumes the steps conducted at this SMS following the review protocol and finally a filtered excel sheet has been obtained with the primary studies selected. SCOPUS database indexes also publications indexed in other databases such as IEEE and Springer. In such cases, a depuration was executed to eliminate duplicates.

After searching in Scopus and ACM databases additional papers were included using snowballing. From the total of 958 papers obtained, the inclusion/exclusion criteria were applied and finally a resumed of 67 papers were selected as the primary studies.

Reviewing the number of citations in the primary studies, it has been found that five papers stand out from the rest. The most cited with 121 citations, is related to measuring the quality of Open Government Data using data quality dimensions [65]. With 76 citations, [42] proposes a Quality-in-use-model through their “3As model” which involves Contextual Adequacy, Operational Adequacy and Temporal Adequacy. The third most cited paper has 66 citations and explores the Quality-of-Service (QoS) approaches in the context of Internet of Things (IoT) [71]. In [29] is reviewed the quality of social media data in big data architecture and has 48 citations. Finally, [39] with 40 citations, proposes a framework to evaluate the quality of open data portals on a national level.

Fig. 1.
figure 1

Facets inside the search and filtering process.

3.4 Quality Assessment

To assess the quality of the chosen literature some parameters were defined as Quality Assessments (QA) such as:

  • QA-1: Are the objectives and the scope clearly defined?

  • QA-2: Do they proposes/discusses a quality model or related approaches? (if yes, the quality model is applied to a specific Big Data application?)

  • QA-3: Do they discuss and present quality dimensions/characteristics for specific purpose?

  • QA-4: Do they provide assessment metrics?

  • QA-5: Where the results compared to other studies?

  • QA-6: Where the results evaluated?

  • QA-7: Do they present open themes for further searches?

At this point, the next step was assessing the quality to the selected primary studies which overall results are presented in Table 2. This is a process which complements the inclusion/exclusion is assigned to answer on each paper the quality assessments described above. These primary studies were scored to determine how well the seven quality items defined were satisfied. The punctuation system used was basically a predefined scale with Y-P-N (Y: Yes, P: Partially, N: No), which was weighted as Y: 1 point, P: 0.5 points, N: 0 points.

Table 2. Quality assessment overall results

4 Results

The overall results of this SMS are presented in the current section. A distribution per document type exposed that the largest number of documents obtained (95,52%) are distributed as conference papers and journal articles.

Regarding the year of publication inside the initial range of 2010 – 2020, a gradual increase can be seen starting from 2014 in the number of studies published on the related topics. The 67% of all selected studies were published in the last three years (2018–2020), the 88% of all selected studies were published in the last five years (2016–2020) which is indicating that the issue of quality models in the context of Big Data is receiving more attention among the researchers, and if this trend continues the theme could become in one of the hottest research topics.

Regarding the publisher, Fig. 2 shows that most papers were published between Springer (26,87%), IEEE (25,37%), ACM (11,94%), and Elsevier (8,96%), most of them were indexed in SCOPUS.

Fig. 2.
figure 2

Paper distribution per publisher

Following, the research questions could be answered thanks to the review conducted, the findings are presented as follows.

RQ1:

What quality models related to Big Data have been proposed in the last 10 years?

The study has revealed that 12 different quality model types has been published in the last 10 years, the most commons are those related with measuring Data Quality, Service Quality, Big Data Quality and Quality-In-Use. A complete distribution of these quality models can be viewed in Fig. 3. It is not a surprise that the largest number of quality models proposed are those related to measuring the quality of the data, representing almost half of all models found.

RQ2:

In what Big Data context have these quality models been proposed?

The majority of quality models proposed can be applied to any Big Data project without distinguishing between the different types of Big Data applications. A number of 8 approaches have been identified as possible field of application which can be regarded in Table 3.

There is a differentiation between general Big Data projects and Open Data projects mainly because the dimensions presented for those Open Data are related such as free access, always available, data conciseness, data and source reputation, and objectivity among others, are specially required in Open Data projects. For Big Data Analytics, Decision Making and Machine Learning projects there is no such great differentiation with other Big Data projects, only in the case where non-functional requirements must be measured that are specific to the required purpose.

Fig. 3.
figure 3

Paper distribution per quality model type

Table 3. Quality model distribution per Big Data context

RQ3:

What Big Data quality characteristics are proposed as part of these quality models?

In this case, is necessary to make a differentiation among the quality model types, because the authors have identified different quality dimensions depending upon the focus of the quality model inside Big Data context.

Data Quality Models:

Are defined as a set of relevant attributes and relationships between them, which provides a framework for specifying data quality requirements and evaluating data quality. Represents data quality dimensions and the association of such dimensions to data. Good examples of those models are presented in [21, 29, 31, 48, 61, 65]. Figure 4 shows the categories that can be used to group the different data quality dimensions presented in the quality models.

Quality dimensions are presented in 28 from 33 papers related with data quality models. The most common dimensions for general data quality are:

  • Completeness: characterizes the degree to which data have values for all attributes and instances required in a specific context of use. Also, data completeness is independent of other attributes (data may be complete but inaccurate).

  • Accuracy: characterizes the degree to which data attributes represent the true value of the intended attributes in the real world, like a concept or event in a specific context of use.

  • Consistency: characterizes the degree to which data attributes are not contradicted and are consistent with other data in a context of use.

  • Timeliness: characterizes the latest state of a data attribute and its period of use.

Fig. 4.
figure 4

Categories founded in the SMS that groups the data quality dimensions presented in the quality models.

In addition, for those quality models where the attention was focused on measuring the quality of metadata, in [29] the quality dimensions identified are believability, corroboration, coverage, validity, popularity, relevance, and verifiability. Other four dimensions are included apart from existing ones to Semantic Data [31] which are objectivity, reputation, value added and appropriate amount of data. For Signal Data [35], other dimensions were identified such as availability, noise, relevance, traceability, variance, and uniqueness. Finally, other two dimensions were included for Remote Sensing Data [7] which are resolution and readability.

It should be noted that quality dimensions proposed in each of these quality models refer to quality aspects that need to be verified by them in the specific context of use.

Service Quality Models:

Are used to describe the way on how to achieve desired quality in services. This model measures the extent to which the service delivered meets the customer’s expectations. Good examples of these models are presented in [8, 30, 32, 36, 41, 64]. The most common quality dimensions collected from those papers are: Reliability, Efficiency, Availability, Portability, Responsiveness, Real-time, Robustness, Scalability, Throughput.

Quality-In-Use Models:

Defines the quality characteristics that the datasets that are used for a specific use must present to adapt to that use. In this research two papers were found that present such type of models [11] and [42], other papers discuss about them. These models are focused mainly in two dimensions: Consistency and Adequacy represented in Fig. 5.

Fig. 5.
figure 5

Quality dimensions presented in Quality-In-Use Models

It should be noted that, depending upon the quality characteristics that wants to be evaluated and the context of use, a different model should be applied. In those models, the two dimensions analyzed are presented as:

  • Consistency: The capability of data and systems of keeping the uniformity of specific characteristics when datasets are transferred across the networks and shared by the various types of consistencies.

  • Adequacy: The state or ability of being good enough or satisfactory for some requirement, purpose or need.

Big Data Systems Quality Models:

There isn’t a general definition for this types of models because the enormous number of different kinds. In this research will be defined as quality models applied to the context of Big Data viewed at a high level. Good examples of these models are presented in [28, 37, 50, 56]. The quality dimensions presented in these quality models are specified in Table 4, and can be separated in three groups:

  • Dimensions for Big Data value chain

  • Dimensions for Non-Functional requirements in Big Data Systems

  • Dimensions for measuring Big Data characteristics.

Table 4. Quality dimensions for big data systems quality models

RQ4:

Have Big Data quality models been proposed to be applied to any type of Big Data application or by considering the quality characteristics required in specific types of Big Data applications?

A Big Data application (BDA) processes a large amount of data by means of integrating platforms, tools, and mechanisms for parallel and distributed processing. As was presented in Table 3, the majority of quality models proposed (74,63%) can be applied to any Big Data project and only a few studies were developed specifically for Big Data Analytics such as [66], Decision Making process in [2] and Machine Learning presents in [54] and [55].

This could be means that researchers are not interested on develop a quality model for a specific Big Data application, instead a general quality model is proposed focusing on such topics like assuring the quality of data, the Quality-in-use, the quality of services involved, etc.

5 Threats to Validity of Our Mapping Study

The main threats to validity of our mapping study are:

  • Selection of search terms and digital libraries. We search in two digital libraries and to complete our study other libraries should be included such as: IEEE, Springer, and Google scholar. In addition, because Big Data is an industrial issue it is recommended to include gray literature search [24] and achieve a Multivocal Literature Review (MLR).

  • Selection of studies. Could be a better solution to apply other exclusion criteria such as the quality of papers. For example, if the results have been validated and compared to other studies.

  • Quality model categorization. As a result of the small sample of papers in which quality models are not related to data quality, it is an arduous task to obtain sample quality metrics and quality dimensions for those Big Data quality models. With the amplification of the current study more samples could be obtained to support this task.

  • Data categorization. We included all the categories identified in the primary papers, the extraction and categorization process was carried out by the first author, a MSc student with over five years of work experience in software and data engineering. The other two coauthors provided input to resolve ambiguities during the process. In this respect, the extraction and categorization process is partially validated.

6 Conclusions and Future Work

A SMS have been conducted to analyze and visualize the different quality models that have been proposed in the context of Big Data and the quality dimensions presented on each type of quality model. It has been found that different from what would have been thought, there is a considerable number of papers which do not present or partially discuss the quality metrics to evaluate the quality dimensions proposed in the model. Also, in the majority of studies the results of their research was not analyzed and compared with other similar studies.

This research have revealed that proposing, discussing, and evaluating new quality models in the context of Big Data is a topic that is currently receiving more attention from researchers and with the actual tendency we should expect an increase of papers related with quality models in Big Data context in the coming years.

As first topic for future work we will consider an in-depth review of the analyzed papers where common metrics, quality dimensions and quality models evaluations could be obtained for further analysis on each Big Data quality model type.

In the context of Big Data, most of the proposed quality models are designed for any Big Data application and they are not explicit in evaluating a specific type of Big Data application such as Feature Prediction Systems or Recommenders. Considering their different specificities to assess the expected quality in the final result when using these Big Data applications, we consider this as an open research topic.