Keywords

1 Introduction

The Internet penetration constantly increases, as more and more people browse the Web, use email and social network applications to communicate with each other or access wireless multimedia services, such as mobile TV [27, 43]. Additionally, several demanding mobile network services are now available, which require increased data rates for specific operations, such as device storage synchronization to cloud computing servers or high resolution video [3436]. The access to such a global information and communication infrastructure along with the advances in digital sensors and storage have created very large amounts of data, such as Internet, sensor, streaming or mobile device data. Additionally, data analysis is the basis for investigations in many fields of knowledge, such as science, engineering or management. Unlike web-based big data, location data is an essential component of mobile big data, which are harnessed to optimize and personalize mobile services. Hence, an era where data storage and computing become utilities that are ubiquitously available is now introduced.

Furthermore, algorithms have been developed to connect datasets and enable more sophisticated analysis. Since innovations in data architecture are on our doorstep, the ‘big data’ paradigm refers to very large and complex data sets (i.e., petabytes and exabytes of data) that traditional data processing systems are inadequate to capture, store and analyze, seeking to glean intelligence from data and translate it into competitive advantage. As a result, big data needs more computing power and storage provided by cloud computing platforms. In this context, cloud providers, such as IBM [23], Google [17], Amazon [2] and Microsoft [38], provide network-accessible storage priced by the gigabyte-month and computing cycles priced by the CPU-hour [8].

Although big data is still in the preliminary stages, comprehensive surveys exist in the literature [1, 911, 20, 37, 59]. This survey article aims at providing a holistic perspective on big data and big data-as-a-service (BDaaS) concepts to the research community active on big data-related themes, including a critical revision of the current state-of-the-art techniques, definition and open researches issues. Following this introductory section, Sect. 2 presents related work approaches in the literature, including the architecture and possible impact areas. Section 3 demonstrates the business value and long-term benefits of adopting big data-as-a-service business models and attempts to communicate the findings to non-technical stakeholders, while Sect. 4 points out opportunities, challenges and open research issues in the big data domain. Finally, Sect. 5 concludes this tutorial chapter.

2 Big Data: Background and Architecture

IBM data scientists argue that the key dimensions of big data are the “4Vs”: volume, velocity, variety and veracity [21]. As large and small enterprises constantly attempt to design new products to deal with big data, the open source platforms, such as Hadoop [53], give the opportunity to load, store and query a massive scale of data and execute advanced big data analytics in parallel across a distributed cluster. Batch-processing models, such as MapReduce [14], enable the data coordination, combination and processing from multiple sources. Many big data solutions in the market exploit external information from a range of sources (e.g., social networks) for modelling and sentiment analysis, such as the IBM Social Media Analytics Software as a Service solution [22]. Cloud providers have already begun to establish new data centers for hosting social networking, business, media content or scientific applications and services. In this direction, the selection of the data warehouse technology depends on several factors, such as the volume of data, the speed with which the data is needed or the kind of analysis to be performed [25]. A conceptual big data warehouse architecture is presented in Fig. 1 [24].

Fig. 1
figure 1

A conceptual big data warehouse architecture

Another significant challenge is the delivery of big data capabilities through the cloud. The adoption of big data-as-a-service (BDaaS) business models enables the effective storage and management of very large data sets and data processing from an outside provider, as well as the exploitation of a full range of analytics capabilities (i.e., data and predictive analytics or business intelligence are provided as service-based applications in the cloud). In this context, Zheng et al. [59] critically review the service-generated big data and big data-as-a-service (see Fig. 2) towards the proposal of an infrastructure to provide functionality for managing and analyzing different types of service-generated big data. A big data-as-a-service framework has been also employed to provide big data services and data analytics results to users, enhance efficiency and reduce cost.

Fig. 2
figure 2

Service-generated big data and big data-as-a-service as presented by Zheng et al. [59]

Fig. 3
figure 3

A conceptual architecture of service-oriented decision support systems as presented by Demirkan and Delen [15]

The development of a cloud-supported big data mining platform, which provides statistical and data analytics functions, has been also explored [56]. In this research work, the platform’s architecture is composed of four layers (i.e., infrastructure, virtualization, data set processing and services), implementing the K-means algorithm. A big data analytics-related platform was proposed by Park et al. [40], which includes a CCTV metadata analytics service and aims to manage big data and develop analytics algorithms through collaboration between data owners, scientists and developers. Since modern enterprises request new solutions for enterprise data warehousing (EDW) and business intelligence (BI), a big data provisioning solution was elaborated by Vaquero et al. [55], combining hierarchical and peer-to-peer data distribution techniques to reduce the data loading time into the virtual machines (VMs). The proposed solution includes dynamic topology and software configuration management techniques for better quality of experience (QoE) and achieves to reduce the setup time of virtual clusters for data processing in the cloud. A cloud-based big data analytics service provisioning platform, named CLAaaS, has been presented in the literature along with a taxonomy to identify significant features of the workflow systems, such as multi-tenancy for a wide range of analytic tools and back-end data sources, user group customization and web collaboration [60]. An overview of the analytics workflow for big data is shown in Fig. 4 [3]. On the other hand, an admission control and resource scheduling algorithm is examined in another work [58], which manages to satisfy the quality of service requirements of requests, adhering to the Service Level Agreements (SLAs) guarantees, and improve the Analytics-as-a-Service (AaaS) providers’ competitiveness and profitability. A framework for service-oriented decision support systems (DSS) in the cloud has been also investigated, focusing on the product-oriented decision support systems environment and exploring engineering-related issues [15]. A conceptual architecture of service-oriented decision support systems is shown in Fig. 3.

Fig. 4
figure 4

Analytics workflow for big data as presented by Assunção et al. [3]

The growth of cloud computing, big data and analytics [52] compels businesses to turn into big data-as-a-service solutions in order to overcome common challenges, such as data storage or processing power. Although there is related work in the literature in the general area of cost-benefit analysis in cloud and mobile cloud computing environments, a research gap is observed towards the evaluation and classification of big data-as-a-service business models. Several research efforts have been devoted comparing the monetary cost-benefits of cloud computing with desktop grids [26], examining cost-benefit approaches of using cloud computing to extend the capacity of clusters [13] or calculating the cloud total cost of ownership and utilization cost [30] to evaluate the economic efficiency of the cloud. Finally, novel metrics for predicting and quantifying the technical debt on cloud-based software engineering and cloud-based service level were also proposed in the literature from the cost-benefit viewpoint [44, 45] and extended evaluation results are discussed by Skourletopoulos et al. [46].

3 Cloud-Supported Big Data: Towards a Cost-Benefit Analysis Model in Big Data-as-a-Service (BDaaS)

In previous research works, the cloud was considered as a marketplace [7], where the storage and computing capabilities of the cloud-based system architectures can be leased off [47, 49, 50]. Likewise, the rise of large data centers has created new business models, where businesses lease storage in a pay-as-you-go service-oriented manner [32, 57]. In this direction, the big data-as-a-service (BDaaS) model was introduced in order to provide common big data services, boost efficiency and reduce cost [51]. Communicating the business value and long-term benefits of adopting big data-as-a-service business models against the conventional high-performance data warehouse appliances to non-technical stakeholders is imperative. In this book chapter, a brief survey of a novel quantitative, cloud-inspired cost-benefit analysis metric in big data-as-a-service is presented, based on previous research studies in the literature [48]. Hence, the cost analysis (CA) modelling from the conventional data warehouse appliance (DWH) viewpoint takes the following form and the descriptions of the exploited variables are shown in Table 1:

$$CA_{i} = 12 \,*\, \left( {C_{{{s \mathord{\left/ {\vphantom {s m}} \right. \kern-0pt} m}}} * S_{max} } \right) , \quad i \ge 1 \,\,and\,\, S_{curr} \le S_{max}$$
(1)

where,

$$C_{{{s \mathord{\left/ {\vphantom {s m}} \right. \kern-0pt} m}}} = C_{{{s \mathord{\left/ {\vphantom {s m}} \right. \kern-0pt} m}(max)}} = C_{{{\alpha \mathord{\left/ {\vphantom {\alpha m}} \right. \kern-0pt} m}(max)}} + C_{{{\gamma \mathord{\left/ {\vphantom {\gamma m}} \right. \kern-0pt} m}(max)}} + C_{{{\eta \mathord{\left/ {\vphantom {\eta m}} \right. \kern-0pt} m}(max)}} + C_{{{\theta \mathord{\left/ {\vphantom {\theta m}} \right. \kern-0pt} m}(max)}} + C_{{{\kappa \mathord{\left/ {\vphantom {\kappa m}} \right. \kern-0pt} m}(max)}} + C_{{{\lambda \mathord{\left/ {\vphantom {\lambda m}} \right. \kern-0pt} m}(max)}} + C_{{{\mu \mathord{\left/ {\vphantom {\mu m}} \right. \kern-0pt} m}(max)}} + C_{{{\sigma \mathord{\left/ {\vphantom {\sigma m}} \right. \kern-0pt} m}(max)}}$$
Table 1 Notations and variable descriptions

As the benefits of cloud computing (i.e., scalability) do not stand in data warehouse appliances, the cost analysis approach adopted in this study does not consider the storage capacity currently used (S curr ). Therefore, the cost variations, due to the fluctuations in the demand for storage capacity, do not apply as long as S curr  ≤ S max and the true benefits are always zero (C D  = 0) over the years. In case of such an increase in the demand for storage capacity that S curr  > S max , incremental capacity should be added to the storage systems with overhead and downtime. On the contrary, the cost-benefit analysis modelling from the big data-as-a-service point of view takes the following form during the first year (i.e., Eqs. 2 and 4) and from the second year and onwards (i.e., Eqs. 3 and 5):

$$CA_{1} = 12 \,*\,\left( {C_{{{s \mathord{\left/ {\vphantom {s m}} \right. \kern-0pt} m}}} \,*\,S_{curr} } \right)$$
(2)
$$CA_{i} = 12 \,*\,\left( {\varDelta_{i - 2} \,*\,{B}_{i - 2} } \right) , i \ge 2$$
(3)
$$C_{{D_{1} }} = 12 \,*\,\left[ {C_{{{s \mathord{\left/ {\vphantom {s m}} \right. \kern-0pt} m}}} \,*\,\left( {S_{max} - S_{curr} } \right)} \right]$$
(4)
$$C_{{D_{i} }} = 12 \,*\,\left[ {\varDelta_{\iota - 2} \,*\,\left( {S_{max} - {B}_{i - 2} } \right)} \right] , i \ge 2$$
(5)

where,

$$C_{{{s \mathord{\left/ {\vphantom {s m}} \right. \kern-0pt} m}}} = C_{{{s \mathord{\left/ {\vphantom {s m}} \right. \kern-0pt} m}(curr)}} = C_{{{\alpha \mathord{\left/ {\vphantom {\alpha m}} \right. \kern-0pt} m}(curr)}} + C_{{{\gamma \mathord{\left/ {\vphantom {\gamma m}} \right. \kern-0pt} m}(curr)}} + C_{{{\eta \mathord{\left/ {\vphantom {\eta m}} \right. \kern-0pt} m}(curr)}} + C_{{{\theta \mathord{\left/ {\vphantom {\theta m}} \right. \kern-0pt} m}(curr)}} + C_{{{\kappa \mathord{\left/ {\vphantom {\kappa m}} \right. \kern-0pt} m}(curr)}} + C_{{{\lambda \mathord{\left/ {\vphantom {\lambda m}} \right. \kern-0pt} m}(curr)}} + C_{{{\mu \mathord{\left/ {\vphantom {\mu m}} \right. \kern-0pt} m}(curr)}} + C_{{{\sigma \mathord{\left/ {\vphantom {\sigma m}} \right. \kern-0pt} m}(curr)}}$$
$$\varDelta_{0} = \left( {1 + \delta_{1}\, \% } \right)\,*\,C_{{{s \mathord{\left/ {\vphantom {s m}} \right. \kern-0pt} m}}}$$
$$\varDelta_{i} = \left( {1 + \delta_{i + 1} \, \% } \right)\,*\,\varDelta_{i - 1} , i \ge 1$$
$$\delta_{i} \, \% = a_{i} \, \% + \gamma_{i} \, \% + \eta_{i} \, \% + \theta_{i} \, \% + \kappa_{i} \, \% + \lambda_{i} \, \% + \mu_{i} \, \% + \sigma_{i} \, \% , i \ge 1$$
$${ B}_{0} = \left( {1 + \beta_{1} \, \% } \right)\,*\,S_{curr}$$
$${B}_{i} = \left( {1 + \beta_{i + 1} \, \% } \right)\,*\,{B}_{i - 1} , i \ge 1$$

The amount of profit not earned due to the underutilization of the storage capacity is measured, under the assumption that fluctuations in the demand for cloud storage occur. The possible upgradation of the storage and the risk of entering into new and accumulated costs in the future are also examined. The cloud storage capacity to be leased off is evaluated with respect to the following assumptions:

  • The cloud storage is subscription-based and the billing vary over the period of l-years due to the fluctuations in the demand for storage capacity (i.e., gigabyte per month).

  • The total network cost consists of bandwidth usage, egress and data transfer costs between regional and multi-region locations. As the cloud-based, always-on mobile services are usually sensitive to network bandwidth and latency [42], the additional network cost is expected to satisfy the outbound network traffic demands in order to avoid delays.

  • Since the content retrieval from a bucket should be faster than the default, the additional on-demand I/O cost enables to increase the throughput [4, 39].

  • The additional server cost stems from the additional CPU cores and the amount of memory required for processing.

Two possible types of benefits calculation results are encountered, when leasing cloud storage:

  • Positive calculation results, which point out the underutilization of the storage capacity.

  • Negative calculation results, which reveal the immediate need for upgradation. This need stimulates additional costs; however, the total amount of accumulated cost in conventional data warehouse appliances is not comparable, as the earnings by adopting a big data-as-a-service business model can be reinvested on the additional storage required, maximizing the return on investment.

Towards the evaluation of big data-as-a-service business models and the increase in the return on investment, the way the benefits overcome the costs is of significant importance [12, 28, 32, 41]. An illustrative example emphasizes on the need to consolidate data from different sources. Cost analysis and benefits comparisons are performed during a 5-year period of time (l = 5) prior to adoption of either a conventional data warehouse or a big data-as-a-service business model. The predicted variations in the demand for cloud storage with respect to two case scenarios are shown in Table 2.

Table 2 Variations in the demand for cloud storage regarding two case scenarios

In this framework, the first case scenario reveals that adopting a big data-as-a-service business model is more cost-effective than a conventional data warehouse, as the cost analysis results for the big data-as-a-service model have the least positive values throughout the 5-year period. The benefits calculation results are positive in big data-as-a-service business models, while the benefits results are always zero in conventional data warehouse business models (Figs. 5 and 6).

Fig. 5
figure 5

Cost analysis for the first case scenario

Fig. 6
figure 6

Benefits analysis for the first case scenario

On the other hand, the second case scenario points out the cost-effectiveness and the benefits gained by adopting the big data-as-a-service model during the first four years. However, the benefits calculations results become negative during the fifth year, indicating the need for immediate upgradation to meet the demand requirements. The necessity for upgradation is also witnessed at the increased costs compared to those in the traditional data warehouse approach. In this direction, the earnings gained throughout the period, due to the selection of the dig data-as-a-service business model, will be reinvested on the additional storage required, maximizing the return on investment (ROI) (Figs. 7 and 8).

Fig. 7
figure 7

Cost analysis for the second case scenario

Fig. 8
figure 8

Benefits analysis for the second case scenario

4 Challenges and Open Research Issues

The rise and development of social networks, multimedia, electronic commerce (e-Commerce) and cloud computing have increased considerably the data. Additionally, since the needs of enterprise analytics are constantly growing, the conventional hub-and-spoke architectures cannot satisfy the demands and, therefore, new and enhanced architectures are necessary [15]. In this context, new challenges and open research issues are encountered, including storage, capture, processing, filtering, analysis, curation, search, sharing, visualization, querying and privacy of the very large volumes of data. The aforementioned issues are categorized and elaborated as follows [11]:

  • Data storage and management: Since big data are dependent on extensive storage capacity and data volumes grow exponentially, the current data management systems cannot satisfy the needs of big data due to limited storage capacity. In addition, the existing algorithms are not able to store data effectively because of the heterogeneity of big data.

  • Data transmission and curation: Since network bandwidth capacity is the major drawback in the cloud, data transmission is a challenge to overcome, especially when the volume of data is very large. For managing large-scale and structured datasets, data warehouses and data marts are good approaches. Data warehouses are relational database systems that enable the data storage, analysis and reporting, while the data marts are based on data warehouses and facilitate the analysis of them. In this context, NoSQL databases [19] were introduced as a potential technology for large and distributed data management and database design. The major advantage of NoSQL databases is the schema-free orientation, which enables the quick modification of the structure of data and avoids rewriting the tables.

  • Data processing and analysis: Query response time is a significant issue in big data, as adequate time is needed when traversing data in a database and performing real-time analytics. A flexible and reconfigured grid along with the big data preprocessing enhancement and consolidation of application- and data-parallelization schemes can be more effective approaches for extracting more meaningful knowledge from the given data sets.

  • Data privacy and security: Since the host of data or other critical operations can be performed by third-party services or infrastructures, security issues are witnessed with respect to big data storage and processing. The current technologies used in data security are mainly static data-oriented, although big data entails dynamic change of current and additional data or variations in attributes. Privacy-preserving data mining without exposing sensitive personal information is another challenging field to be investigated.

5 Summary and Conclusion

Since networking is ubiquitous and vast amounts of data are now available, big data is envisioned to be the tool for productivity growth, innovation and consumer surplus. Huge opportunities related to advanced big data analytics and business intelligence are at the forefront of research, focusing on the investigation of innovative business-centric methodologies that can transform various sectors and industries, such as e-commerce, market intelligence, e-government, healthcare and security [4, 29, 31, 54]. To this end, this tutorial paper discusses the current big data research and points out the research challenges and opportunities in this field by exploiting cloud computing technologies and building new models [5, 6, 16, 18, 3336]. A cost-benefit analysis is also performed towards measuring the long-term benefits of adopting big data-as-a-service business models in order to support data-driven decision making and communicate the findings to non-technical stakeholders.