Keywords

1 Introduction

In various fields, commercial and social, the amount of data produced and processed is ever increasing and so appropriate applications are required that are able to manage this vast amount of data, commonly known as Big Data. Scientific disciplines, such as bioinformatics or astrophysics, have already encountered this data deluge several years ago and have recognized the data-intensive science as the fourth research paradigm in addition to the experimental, theoretical, and simulation-based paradigms [1, 2]. For this reason, parallel computing, even based on specialized hardware—and where scaling-out approach was preferred to a scaling-up approach—has been adopted to produce tools for improving the efficiency of scientific computations.

With the proliferation of smart devices, the same type of scenarios is turning up in other domains. Predictions on IoT (Internet of things) estimate 32 billion connected devices by the year 2020 that will produce in excess of 440 petabytes of data (accounting for approximately 10% of the world data) [3]. On top of such huge amounts of data, data processing and analysis are required, often in real time, to extract meaningful information. This becomes crucial when data comes from the operational infrastructure of large companies, and analysis is required to compute key performance indices (KPIs) and to produce reports for the management to support strategic business decisions. Combining this complexity with the need for being faster in management decisions, real-time processing of data streams becomes fundamental [4].

In cloud platforms, these data-intensive applications (DIA ) have found the ideal environment for executing tasks and processing significant amounts of data, while scalability and reliability are guaranteed by the underlying infrastructure [5]. Furthermore, the adoption of cloud solutions has had a significant impact on costs, as storing and managing data in the cloud are typically inexpensive compared to on-premises solutions. The development of this type of applications took advantage of several approaches that had been proposed in the recent years. Especially within the Apache community [6], several projects have been started to deal with batch processing, e.g., Hadoop and real-time processing, e.g., Storm, Spark, or Kafka, just to name a few. With these tools, data-intensive application developers can rely on a specific programming model that simplifies access to heterogeneous data sources storing data in different formats and significantly reduces the effort required to make data processing scalable and reliable. To reduce the inherent latency of the cloud, specific architectures, e.g., the Lambda Architecture [7], have been proposed for real-time analytics.

Although these approaches are now widely adopted, there are situations in which relying on a cloud infrastructure for implementing data-intensive applications is not beneficial, especially when data are produced outside of the cloud by devices (such as smart objects, laptops, servers, dedicated systems) residing on the premises of who want to analyze the data thus produced. Firstly, when data are produced at the edge of the network but processed in the cloud using the solutions mentioned above, bandwidth could become a bottleneck thus increasing the latency. Secondly, security and privacy issues are still one of the main reasons why cloud adoption remains limited especially in application domains where data processing mainly involves sensitive information (e.g., e-health) that usually cannot be moved to the cloud as they are. On the other hand, applying the data processing solutions developed for cloud-based infrastructure directly to the edge could be very challenging: Each infrastructure has its own characteristics, could be managed differently, and involve a variety of devices with different characteristics that make a one-size-fits-all approach really difficult.

In such a scenario, Fog Computing [8], often also referred to as Edge Computing [9] becomes attractive models. Fog Computing is an emerging paradigm that aims to extend the Cloud Computing capabilities to fully exploit the potential of the edge of the network where traditional and commonly used devices (e.g., handheld devices, laptops, sensors) as well as new generations of smart devices (e.g. smart wearables and mobile devices) are interconnected within the IoT environment. Especially for data-intensive applications, since the IoT is a source for enormous amounts of data that can be exploited to provide support in a multitude of different domains (e.g., predictive machinery maintenance, patient monitoring), this new paradigm has opened new frontiers [10]. In fact, with Fog Computing, we can take advantage of resources living on the edge of the network and the cloud environment to exploit the respective advantages. Data could be stored closer to where they are produced using edge-located resources if the data cannot leave the boundary of the organization that owns them; conversely, cloud solutions could be adopted only after data are transformed to preserve privacy . Orthogonally, data processing could occur on the edge when the available information, computational power, and the response time do not require a scalable solution. On the other hand, data processing will be moved to the cloud when limitation of network bandwidth is not an issue.

The goal of this chapter is to discuss how data management in data-intensive applications can be improved through data and computation movement in Fog environments. In particular, this chapter focuses on the experience of the European DITAS project [11] which aims to improve, through a combined cloud and fog platform, the development of data-intensive applications by enabling information logistics in Fog environments for delivering information at the right time, the right place, and with the right quality [12] using both resources belonging to the cloud and the edge. The resulting data and computation movement are enabled by virtual data containers (VDCs ) that provide an abstraction layer, adopting the Service-Oriented Computing [13] principles and hiding the underlying complexity of an infrastructure made of heterogeneous devices. Applications developed using the DITAS toolkit are able to exploit the advantages of both cloud-based solutions in terms of reliability and scalability , as well as edge-based solutions with respect to latency and privacy .

The rest of this chapter is structured as follows. Section 12.2 introduces the characteristics of data-intensive applications when immersed in a Fog Computing -based environment. Section 12.3 discusses the approach adopted in the DITAS project to support the deployment and execution of a data-intensive application in Fog environments, while Sect. 12.4 specifically focuses on the data and computation movement actions. Finally, Sect. 12.5 discusses related work, while Sect. 12.6 concludes the work outlining the future work that the DITAS project focuses on.

2 Data-Intensive Applications in Fog Computing

Data-intensive applications are becoming more and more crucial elements of IT systems due to the ever-increasing amount of data that needs to be managed. Several types of data-intensive applications can be developed to cover one or more phases of the data management, i.e., data capture and storage, data transmission, data curation, data analysis, and data visualization [5].

In recent years, usually under the umbrella of Big Data , researchers and practitioners have focused on providing tools, methods, and techniques for efficiently managing extensive amounts of data in various formats and schemas. As a result, distributed file systems (e.g., HDFS), new generations of DBMS where the relational model is no longer adopted (e.g., MongoDB [14], Cassandra [15]), and new programming models (e.g., MapReduce), as well as new architectures (e.g., Lambda Architecture ) have been proposed. Regardless of the specific solution, most of them usually rely on resources available on the cloud, thus offering the possibility to easily scale applications in or out with the amount of data to be processed. Actually, in some cases, relying only on cloud infrastructures cannot be feasible for two main reasons: (i) data cannot be moved from where they are collected due to privacy /security issues, and (ii) the time required to move data to the cloud could be prohibitive. In such a scenario, Fog Computing [7] aims to support the required synergy between the cloud—where the application usually runs—and the devices at the edge of the network—where the data are generated—especially in IoT environments. In fact, cloud and edge are usually seen as two distinct and independent environments that, based on the specific needs, are connected to each other to move data usually from the edge to the cloud. To create a synergy between these two paradigms, Fog Computing has been coined—initially in the telco sector—to identify a platform able “to provide compute, storage, and networking services between cloud data centers and devices at the edge of the network” [16]. Based on this, and also in the light of the definition proposed by the OpenFog Consortium [8], we consider Fog Computing as the sum of Cloud and Edge Computing where these two paradigms seamlessly interoperate to provide a platform where both computation and data can be exchanged in both downstream and upstream direction [17] (refer to Fig. 12.1).

Fig. 12.1
figure 1

Fog Computing environment

Based on these definitions, Cloud Computing is mainly related to the core of the network, whereas Edge Computing is focused on supporting the owners of resources through the local “in-situ” means for collecting and preprocessing data before sending it to cloud resources (for further utilization), thus addressing typical constraints of sensor-to-cloud scenarios like limited bandwidth and strict latency requirements. Furthermore, cloud resources include physical and virtual machines which are capable of processing and storing data. On the other hand, smart devices, wearables, or smartphones belong to the set of edge-located sources. While Cloud Computing is devoted to efficiently managing capabilities and data in the cloud, Edge Computing is focused on providing the means for collecting data from the environment (which will then be processed by cloud resources) to the owner of the available resources.

Exploiting the Fog Computing paradigm, these two environments seamlessly interoperate to provide a platform where computation and data can be exchanged in both downstream and upstream direction. For instance, when data cannot be moved from the edge to the cloud, e.g., due to privacy issues, then the computation is moved to the edge. Similarly, when there are insufficient resources at the edge, data are moved to the cloud for further processing. The DITAS project aims to provide tools, specifically designed for developers of data-intensive applications, which are able to autonomously decide where to move data and computation resources based on information about the type of the data, the characteristics of applications as well as the available resources at both cloud and edge locations along with application constraints such as the EU GDPR privacy legislation [18].

To achieve this goal, DITAS adopts Service-Oriented Computing principles [13] where data used by data-intensive application developers are provided through the Data as a Service (DaaS) paradigm. As shown in Fig. 12.2, we assume the existence of several data providers which take care of optimizing the data collection and provisioning. Data sources could be deployed on the edge (e.g., data coming from sensors) or on the cloud (e.g., data about business transactions). The goal of the data provider is to develop and deploy a DaaS which hides the complexity of managing his/her data sources and to provide them through APIs. Such APIs are used by data consumers that, in our scenario, are represented by data-intensive application developers which process the obtained data in order to generate added-value applications.

Fig. 12.2
figure 2

DITAS approach

Focusing on these two main standpoints, the next paragraphs focus on how Service-Oriented Computing can be adopted in a Fog environment.

2.1 Data Provisioning

With respect to the typical data management life cycle , data providers are usually in charge of collecting, storing, and supporting access to data over which they have complete control. For instance, in IoT scenarios, data are generated at the edge of the network (e.g., sensors, mobile devices), and the data provider has to set up an environment allowing the data consumers to properly access them. This requires moving the data to the cloud, where a seemingly unlimited amount of resources is available for efficient storage and processing of data as well as exposing it through APIs. Although cloud resources ensure high reliability and scalability , network capacity might negatively influence latency when data movements among resources on the cloud and the edge occur; thus, the advantage of the fast processing at the cloud might be wasted, and the offered service quality might be negatively affected.

As an example, a data provider could be a highway manager that offers data about the status of the traffic, or time series about the number of vehicles, their type, accidents, etc. Assuming that this information is obtained by reading values of sensors in the field, or based on the information coming from applications, these data are usually moved to the cloud so that they are easily and widely accessible.

While cloud platforms provide solutions where interoperability among different infrastructures is now easy to achieve, we cannot say the same for the edge part. Indeed, agreement about protocols, data formats, and interfaces of smart devices, sensors, and smartphones has not been achieved yet. This results in difficulties when data providers have to deal with heterogeneous devices that need to communicate or, at least, need to send the data to the cloud for further processing.

Focusing on the processing, a significant issue to be taken into account concerns the ever-increasing computational and storage capabilities provided by the resources on the edge. Regarding data storage, once the data are created, it is not required to immediately move them to a capable storage in the cloud. Conversely, it is possible to leave the data where they are produced and, in addition, to exploit the computational power to perform some preprocessing directly on the edge.

As a last step in the data management life-cycle , data providers have to make data available to data consumers also considering that they could have different needs and different capabilities. In this light, principles of Service-Oriented Computing can be adopted to define the DaaS interfaces that data providers have to propose to allow the final users to properly consume the data. Such interfaces have to consider both functional (i.e., how the data can be accessed) and non-functional (i.e., which is the quality of the data and the service) aspects.

Regarding functional aspects, the offered APIs can adopt the REST [19] architectural style, typical SQL-based interfaces or others. Concerning non-functional aspects, data quality dimensions (e.g., timeliness, accuracy) and service quality dimensions (e.g., response time, latency, data consistency) need to be computed and balanced according to consumer expectations.

2.2 Data Consumption

From the consumer standpoint, data are accessible by invoking the available DaaS . Assuming that there could be several data providers, each of them in charge of managing different data sources, data consumers have to deal with a plethora of DaaS, each of them providing specific data with different quality of data and service.

Data-intensive applications are built on top of these DaaS with the goal of analysing and processing of the provided data to create added value for the customer. For instance, data coming from the highway manager in the example above can be combined with weather information to analyze the correlation between accidents and severe weather conditions.

A particular aspect considered in DITAS relates to the combination between the Fog Computing paradigm and the Service-Oriented Computing approach. Consequently, we assume that while resources required for service provisioning are known in advance and under the control of the service provider, there are additional resources that live on the premises of the customers which will be known by the provider right before starting the data consumption along with the quality of data and service requirements. With respect to the typical approach, these additional resources are not part of the client infrastructure, but they can be included in the set of resources belonging to the service provisioning infrastructure. In this way, the data provider can exploit them to improve the user experience of that specific customer. Among the different opportunities that this scenario could open, in this chapter, we focus on the possibility of hosting part of the application logic which composes the data provisioning. Similarly, the resource on the edge of the network can be used to host the data that are considered relevant for the consumer and thus to reduce the latency when users request them.

3 DITAS Approach

From the perspective of a data provider, exposing data following a DaaS paradigm requires to decide where to store data, in which format, how to satisfy the security constraints, and many other aspects. This situation becomes even more complex when dealing with heterogeneous systems where different devices are involved in the data management. For instance, over time, different versions of smart devices might be used to collect data from the sensors installed in manufacturing plants. This implies that the developers have to manage this heterogeneity as well as to properly distribute the data among edge and cloud, to make applications as efficient as possible.

Moving to the data consumer perspective, the development of data-intensive applications requires to select the proper set of DaaS , considering both functional and non-functional aspects, to connect and start interacting with them, ensuring that the agreed quality of data and service is respected, etc. All of these aspects could distract the attention of the data-intensive application developer from the business logic, i.e., to organize the actual data processing.

For this reason, in order to improve the productivity of application developers with the DITAS platform, we aim to offer tools for smart data management trying to hide the complexity related to data retrieval, processing, and storage. To this end, data-intensive applications in DITAS are not directly connected to the data sources containing the necessary data, but the access to these data sources is mediated by a specific component called Virtual data container (VDC) (see Fig. 12.3), which represents the concrete element able to provide a DaaS . In more detail, a VDC:

Fig. 12.3
figure 3

Role of VDC in the DITAS data management

  • Provides uniform access to data sources regardless of where they run, i.e., on the edge or on the cloud.

  • Embeds a set of data processing techniques able to transform data (e.g., encryption, compression).

  • Allows composing these processing techniques in pipelines (inspired by the node-RED programming model) and executing the resulting application.

  • Can be easily deployed on resources which can live either on the edge or in the cloud.

To manage this indirect access between data sources and the data-intensive application , DITAS distinguishes between the data sources’ life cycle and the data-intensive application’s life cycle. The former defines the relationship between a data source and a VDC, whereas the latter defines the relationship between the data-intensive application and the VDCs which give access to the required data.

For this reason, in the terms adopted in DITAS, a data administrator has a complete knowledge of one or more data sets and is responsible for making them available to applications that might be managed by other developers through a DaaS . On the other hand, a DIA developer defines the functional and non-functional aspects of the data-intensive application and selects the best fitting DaaS needed to compute the analysis. Moreover, the DIA developer is in charge of developing the business logic of the data-intensive application. Exploiting the DITAS-SDK, application developers only focus on data management having the virtual data container (VDC) handle the burden of selecting the best location where to store the data and the most suitable format to satisfy both the functional and non-functional requirements specified by the application designer.

Based on the work done by the data administrator and the DIA developer, DITAS offers two environments able to support the data-intensive application life cycle . More specifically:

  • An SDK assisting both the data administrators and DIA application developers.

  • An execution environment (EE) where the deployed VDCs and DIA operate.

The next paragraphs detail the steps that compose the data-intensive application life cycle, as shown in Fig. 12.4. In particular, we have identified two main phases: the design and development phases, which take advantage of the DITAS-SDK and the execution phase which relies on the DITAS-EE.

Fig. 12.4
figure 4

Data-intensive application life cycle in DITAS

3.1 Design and Development Phase

The first step of the application life-cycle concerns the work performed by the data administrator (a.k.a. data provider), who—based on the managed data sources—creates a VDC Blueprint which specifies the characteristics of a VDC in terms of following:

  • The exposed data sources.

  • The exposed APIs.

  • How the data from the data sources needs to be processed in order to make them available through the API .

  • The non-functional properties defining the quality of data and service.

  • The components cookbook: a script defining the modules composing the container as well as their deployment.

As DITAS follows the Service-Oriented Computing principles, the visibility principle requires to publish a description of a service to make it visible to all the potential users. As a consequence, the data administrator publishes the VDC Blueprint. At this stage, no specific approach for the VDC discovery has been adopted (i.e., centralized registry, distributed publication), as it is an issue to be taken into account for future work.

Once published, the DIA developers come into play. In fact, their role is to search for the data that are relevant for the applications they are developing. As the information included in a VDC Blueprint also concerns functional and non-functional aspects, a DIA developer relies on this information to select the most suitable VDC according to its purposes. It is worth noticing that, based on the nature of the DIA, the developer could select different VDCs referring to different data. A peculiar aspect of the DITAS approach concerns the data utility, that is defined as the relevance of data for the usage context, where the context is defined in terms of the designer’s goals and system characteristics [20]. In this way, data utility considers both functional (i.e., which data are needed) and non-functional (e.g., the data accuracy, performance) aspects.

Finally, the developer designs and develops the DIA and deploys it on the available resources which can be located on the edge or in the cloud. The initial deployment is the key element in the approach; as in this phase, it is required to know which are all the possible resources on which the VDC can be executed. As introduced in Sect. 12.2, the considered Fog environment implies that DaaS can be provided using resources belonging to both the provider and the consumer. Without loss of generality, we can assume that the provider resources are always in the cloud, while the consumer resources are always on the edge. In this way, a VDC living in the cloud has more capacity and it probably lives close to the data source to which it is connected. Conversely, a VDC living on the edge has the advantage of living closer to the user, thus reducing latency when providing the requested data. Deciding where to deploy the VDC depends on the resources required by the VDC (e.g., it might happen that the amount of resources to process the data before making them available to the user cannot be provided at the edge), the network characteristics (e.g., the connection at the consumer side can support a high-rate transmission), and security (e.g., not all the data can be moved to the consumer side, thus even the processing cannot be placed at the edge).

Once the DIA has been deployed, DITAS supports a flexible execution that initiates data and computation movement when necessary to ensure the fulfillment of the non-functional requirements.

3.2 Execution Phase

Before introducing the steps of the DIA life cycle related to the execution, it is worth introducing some of the elements composing the DITAS Execution Engine (DITAS-EE). As reported in Fig. 12.5, the DITAS-EE solution is built on top of a Kubernetes  [21] cluster. In fact, given a VDC Blueprint, based on the cookbook section, a docker [22] container is generated and deployed. Furthermore, given a VDC Blueprint, many application developers can select it for their own application. As a consequence, the DITAS-EE has to manage several DIAs which operate with different VDCs. Moreover, as the same VDC Blueprint can be adopted in different applications, each of these applications includes instances generated from the VDC Blueprint; thus, they are connected to the same data sources.

Fig. 12.5
figure 5

DITAS execution environment

To properly manage this concurrent access, given a VDC Blueprint, the DITAS-EE includes a virtual data manager (VDM ) that controls the behavior of the different instances of the same VDC Blueprint to correctly operate on the data sources and no conflict arises when enacting data and/or computation movements.

Thanks to the abstraction layer provided by the VDC, applications deployed through the platform can access the required data regardless of their nature and location (cloud or edge). Due to the distributed nature of the applications to be managed, to the execution environment being distributed by definition and because of the different computational power offered by the devices, it might happen that only a subset of the modules can be installed on a specific edge device. For this reason, at deployment time, not only the data-intensive application is distributed over the cloud and edge federations, but also the execution environment is properly deployed and configured to support the data and computation movement. The decision on where to locate both the application and the data required by the application itself is taken at design time, but can be updated during the application execution, according to the detected state of the execution environment. Some details about the approach followed to support the data and computation movement are introduced in the next section.

4 Data and Computation Movement in Fog Environments

Movement strategies provide solutions for moving data and computation in a Fog environment, taking into consideration all the factors that affect the application execution, data usability, and trying to keep the QoS and the data quality at the levels required by the application designer.

Data and computation movement strategies are used to decide where, when, and how to save data—on the cloud or on the edge of the network—and where, when, and how to move tasks composing the application to create a synergy between traditional and cloud approaches able to find a good balance between reliability, security , sustainability, and cost.

The driver for data and computation movement is the evaluation of the data utility [20]. When an application is deployed through the DITAS platform, the application designer expresses application requirements about the QoS and quality of data used both to lead the data source selection and to select a proper computation and data location. In the application requirements, both hard and soft constraints are expressed. When the evaluation of the data utility does not respect the application designer requests, the VDM will enact the most suitable data and computation movement strategies to balance the posed requirements such as reducing the latency or the data size, ensuring a given accuracy, while maintaining—if requested—privacy and security . Data and computation movements are executed to satisfy all the hard constraints and, as much as possible, soft constraints and requirements expressed by the user with the final objective of executing the requested functionality having in mind also the maximization of the user experience. As the computation movement requires a dynamic deployment of the data processing tasks, data movement could require only a transformation of the data format (e.g., compression or encryption) or could also affect the quality of the data (e.g., data aggregation).

Data and computation movement are managed over the entire life cycle of the application, from its deployment to its dismissal. During this time, the application and its data sources are monitored and evaluated in order to satisfy the hard and soft requirements expressed by the application developer. As the decision of when, how, and where data and computation movement must occur depends on the current situation in which the data-intensive application operates, the execution environment includes a distributed monitoring system.

The management of data and computation movement is a life cycle composed of the steps of the monitor-analyze-plan-execute (MAPE) control loop, as briefly discussed below:

  • Monitor: a DIA is monitored (using the data utility and QoS monitor module) through a set of indicators providing information about both the application behavior and the data source state. This set of indicators can be enriched by a dependency map where such indicators are related to each other, giving a more refined knowledge about the execution environment.

  • Analyze: The result of the previous phase is used to compare the current situation with the required data utility values. If the data utility provided to the application does not satisfy the application requirements, an exception is raised. Such an analysis is one of the main tasks to be executed by the VDM .

  • Plan: According to the detected violation, some movement actions should be enacted. These actions are in fact data movement and computation movement strategies. To support the planning phase, we will study dependencies among the different data and computation movement techniques in order to identify positive or negative effects to the indicators related to aspects such as reliability, response time, security , and quality of data that need to be measured during the execution of applications. Knowing these relations, it is possible to select the proper action for a violation. The impact on data utility derived from the enactment of the data and computation movement strategies will be analyzed and predicted applying data mining techniques to logs obtained from previous executions. Referring to the DITAS-EE architecture, the task movement selector and the data movement selector are the two modules in charge of the planning.

  • Execute: Once the strategies have been selected, they can be enacted in order to fix violations. For data movement, specific modules are executed to move the data from the source to the destination, whereas for computation movement, the set of possible actions corresponds to the Kubernetes capabilities.

We define a movement strategy as a modification in the placement of a computing task or a set of data. The abstract movement strategy is characterized by one movement action and an object category (e.g., data, computation). The abstract movement strategy is also characterized by the effects on the environment that the enactment of such a strategy will cause. This information can be retrieved by a knowledge base built from the observation of previous enactments using machine learning techniques like reinforcement learning techniques.

In order to enact a movement strategy, it is necessary to specify also the actual object of the movement and its initial and final location. We define this as movement strategy instance. A movement strategy instance may be subject to some constraints given by the object of movement. If we consider data movement, a constraint can be related to privacy and security policies on the moved data. These policies are independent from the context and need to be applied anytime a movement action is applied to an object affected by them.

In order to enact a movement action, several alternative techniques may be used. A movement technique is a building block of a movement strategy. Strategies will combine these building blocks according to the specific needs of an application. Similarly, computation movement techniques will be proposed to define how to distribute the tasks to be executed among the available nodes taking into account the requirements of the task, the resource made available by a node, and general requirements at application level. For instance, in case of a movement strategy requiring data aggregation, one of a set of different movement techniques may be selected, each technique implementing a different aggregation algorithm.

The selection of the most suitable movement action for a given context should be driven by the expected utility improvement, which is computed on the basis of the detected violation and the known effects of the strategy on the environment.

Based on this definition of the movement strategy, the primary goal of DITAS is to find a coherent mechanism for deciding which is the best data/computation movement action to be enacted based on the application characteristics, the nature of the data, privacy and security issues, and the application’s non-functional requirements. Creating rules and selecting parameters for an automatic selection of data movement actions in different contexts represents now a major challenge. In Fig. 12.6, we show a preliminary model specifying the influences that need to be mapped between several elements of the environment. More specifically, through the analysis of the data collected by the monitoring system, some events are raised by violations in the data utility, which compose the context in the top layer of the figure. These events are linked to the goals in the middle layer, which are a representation of the user requirements composing the data utility evaluation. At the bottom level, we represent the available movement strategies together with their effects on the goals.

Fig. 12.6
figure 6

Influences among context goals and movement strategies

5 Related Work

Since the 1990s, when interconnection of heterogeneous information systems managed by different owners became easier and when the Web started managing significant amounts of information, the problem of delivering such information has become more and more relevant: The more the data are distributed, the more difficult it is to find the information needed. Thus, tools are required to guide the users in this task. In this scenario, information logistics has emerged as a research field for optimizing the data movement, especially in networked organizations [12]. As discussed in [23], information logistics can be studied from different perspectives, e.g., how to exploit the data collected and managed inside an organization for changing the strategy of such organizations, how to deliver the right information to decision makers in a process, or how to support supply chain management. In our case, according to the classification proposed in [23], we are interested in user-oriented information logistics, i.e., the delivery of information at the right time, the right place, and with the right quality and format to the user [24]; thus, data movement becomes crucial. As a general framework, there are three sets of data movement techniques:

  • The first includes techniques that affect neither the content nor the structure of the moved data. In this case, most of the approaches are application-driven; i.e., applications accessing this data heavily influence the adoption of the techniques [25,26,27,28]

  • The second set concerns techniques that do not modify the content of the data but only their format. Lossless, both spatial and temporal compressions as well as data encryption mechanisms fall into this category with the objective of either reducing the amount of data or making the communication secure [29, 30].

  • Finally, the third set includes techniques that operate on the data transmitted aiming to improve the performance of data movement while maintaining a sufficient level of data quality [24, 31, 32].

To support data movement in a Fog environment which has to deal with heterogeneous devices, data virtualization becomes fundamental. Data virtualization [33] is a data integration technique that provides access to information through a virtualized service layer regardless of data location [34]. Data virtualization [35] allows applications to access data, from a variety of heterogeneous information sources, via a request to a single access point so that it provides a unified, abstracted, real-time, and encapsulated view of information for query purposes and can also transform the information to prepare it for consumption by various applications. Data virtualization solutions add levels of agility (business decision agility, time-to-solution agility, or resource agility) that are difficult to achieve with traditional ETL solutions.

Container-based virtualization is one of the two approaches of lightweight virtualization [36], which minimize the use of processor and memory resources by sharing system calls with the host operating system. Managing data in containers can be done either by keeping the data with the container or by implementing a dedicated data layer [37]. Keeping the data with the container requires the use of techniques that move the data within the container. An example is from ClusterHQ’s Flocker [38], which ensures that when an application container moves, the data container moves with it. By implementing a dedicated data layer for the storage container, data services (databases, file systems) can be implemented on more persistent entities such as virtual machines and physical servers.

6 Concluding Remarks

Fog Computing is an emerging paradigm able to support the development, deployment, and execution of distributed applications. This chapter has focused on a specific type of applications: data-intensive applications which have to deal with the gathering, processing, provisioning, and consumption of data. Following the Service-Oriented Computing principles, this chapter introduces the approach provided by the DITAS project that allows a flexible execution of data-intensive applications. Such flexibility is provided through data and computation movement actions allowing the data-intensive application to change the way in which the processing is distributed at run time, as well as to optimize how the data are distributed among the different nodes involved in the execution which may belong to edge and cloud environments. Computation movement is ensured by the adoption of a containerized solution which creates self-contained modules, i.e., VDC, that can be easily moved around the Fog environments. Data movement is supported by the execution of specific actions that are driven by the data utility that defines the relevance of data for the usage context, where the context is defined in terms of the designer’s goals and system characteristics.

Since the DITAS platform is still under development, future work will focus on the validation in real applications. In particular, the approach will be tested in a real case study concerning an industry 4.0 scenario. Here, data coming from sensors installed on some machinery need to be quickly processed exploiting both the computational power provided at the edge, which ensures reduced latency, and the computational power available in the cloud, which ensures significant scalability . In particular, the impact in terms of overhead introduced the DITAS platform will be analyzed.