1 Introduction

Modern smart cities and smart territories [1] rely on wide-area infrastructures, populating a variety of environments with functionality-rich sensors. These smart environments include wide-area transportation management [2, 3] and large-scale smart parking systems [4, 5]. The emergence of smart environments validates large-scale sensor infrastructures as robust platforms for delivering innovative services to citizens.

Nevertheless, the successful adoption of these infrastructures critically relies on the ability to develop services. Currently, software development in this domain lacks programming models and methodologies to address key domain-specific challenges. In particular, masses of sensors produce large amounts of data that require to be analyzed efficiently to timely deliver high-value services to citizens and operators of smart environments. When considering tens of thousands of measurements, possibly accumulated over a period of time, processing of such data volumes becomes a critical issue. The pressure on processing only increases when the added values of the services rely on real-time or near-real-time analyses. In fact, the data volume to be processed and the velocity requirements of the applications to be developed may necessitate parallel processing [6]. For example, as cars rush into a city in the morning, drivers should receive up-to-date information about space availability in parking lots and estimations about its future trends, even if this involves processing massive amounts of data repeatedly. When efficiency is paramount, it is a key challenge to develop an orchestrating application that exploits properties about the sensors, optimizes the strategies to collect sensor measurements, and crunches large amounts of data.

Beyond allowing to harness large-scale sensor infrastructures, the scalability of data processing is becoming a key enabling factor for delivering personalized and/or community-aware services [7, 8]. Indeed, in such applications, not only do large amounts of sensor data have to be handled, but they must also be combined with massive data, contributed by user communities. These computations must be performed repeatedly for each user and still be delivered in a timely manner. Support for efficient parallel processing can thus also pave the way to the next level of smart city services for citizens, in terms of added value.

Existing approaches dedicated to big data processing provide limited ways to combine data processing strategies with the application logic. Apache Pig [9] and Hive [10] require developers to describe data processing in SQL-like query languages with limited support for user-defined functions. Language libraries, such as FlumeJava [11], allow developers to implement data processing via high-level language abstractions. These approaches provide data flow expressions and a set of rich data types to implement data processing. Developers still need to decide when and where data processing occurs, as well as how intermediate computations are combined. In the case of large-scale orchestration, applications may have to analyze sensor data a number of times using different algorithms, or combine them. These needs put an additional burden on developers since they have to introduce boilerplate code to separate library-specific code from the main application logic, interconnect and coordinate computations, store intermediate results, etc.

This paper proposes a design-driven approach to developing orchestrating applications for masses of sensors that integrates parallel processing of large amounts of data. In doing so, we extend our previous work on a design language dedicated to orchestrating sensors, named DiaSwarm [12], which did not address high-performance data processing. Our new approach provides the developer with declarations expressing when and where data processing occurs. The application design then compiles into a programming framework, based on the MapReduce programming model. This framework supports and guides the programming of the orchestration logic, while abstracting over the parallel processing of sensed data.

This article is an expanded version of a conference paper [13]. It provides details about the implementation of our approach, an example of personalized service enabled by our approach, and a review of the current limitations of the approach together with some corresponding language extensions.

1.1 Our contributions

1.1.1 High-level parallel processing model

Our approach allows the developer to program against a framework based on the MapReduce programming model [14, 15]. In doing so, the developer uses a well-proven approach to processing large datasets, based on a parallel implementation. We illustrate our approach with a case study of a parking management system.

1.1.2 A generative programming approach

The generated parallel-processing programming frameworks have a carefully structured data and control flow, which enables data processing to be implemented efficiently. Our compiler generates programming frameworks that rely on the MapReduce model, exposing structural parallelism of the implementation. This strategy allows to cope with large datasets collected from masses of sensors.

1.1.3 Implementation

Our approach is implementedFootnote 1 and takes the form of a plugin for the Eclipse IDE.Footnote 2 The plugin comprises a code generator, which currently produces programming support for the Apache Hadoop platform.Footnote 3

1.1.4 Validation

Our implementation is validated with an experiment that runs application computations over a large dataset of synthetic sensor readings. The experiment demonstrates that programming frameworks generated by our approach exhibit scalable behavior.

1.1.5 Enabling personalized services

We further illustrate the practical applicability of our approach through an example of personalized service for citizens of a smart city.

1.1.6 Exploring the design space

Finally, we assess the expressiveness of DiaSwarm by reviewing and reconsidering design choices. This results in language extensions.

2 Background and case study

In this section, we provide a brief introduction of the DiaSwarm language [12] dedicated to development of orchestrating applications. DiaSwarm is a declarative domain-specific design language, which follows the Sense/Compute/Control (SCC) paradigm promoted by Taylor et al. [16]. DiaSwarm provides high-level, declarative constructs to allow developers to deal with sensors and actuators at design time, prior to programming the application. Application design is processed by a compiler, which generates support for the developer that takes the form of a programming framework [17]. The generated programming framework reflects application design and covers domain-specific functionalities, such as service discovery, data gathering, component interaction and data processing. These dimensions are fully administered by the framework to allow developers to concentrate on the application logic.

Application design takes the form of a directed acyclic graph (DAG) comprising devices (i.e., sensors and actuators) and application components, namely, contexts and controllers. Context components receive data from sensors via device sources. They refine raw data into application values and may publish these values to controller components. Controllers determine the devices that need to be actuated, as well as the type of action that needs to be triggered.

2.1 Case study

We illustrate the salient features of DiaSwarm with a smart city application, which monitors the occupancy of parking lots to guide cars to available parking spaces. The application collects data from presence sensors, which are buried under the ground and determine availability of parking spaces via magnetic field variations. The application provides drivers with the number of available parking spaces for each parking lot in the city. This information is displayed on screens at the entrance of parking lots. The application also suggests parking lots to drivers entering the city to optimize the flow of traffic. Finally, the application determines the average occupancy level of each parking lot in 24 h. The occupancy level is provided to parking managers via messages.

Figure 1 presents a graphical view of the parking management application in SCC. The PresenceSensor device produces values via the presence source to the subscribed context components, namely ParkingAvailability, ParkingUsagePattern, and AverageOccupancy. The ParkingAvailability context computes the number of available parking spaces in parking lots and publishes these values at regular intervals to the ParkingEntrancePanel controller, which in turn triggers the update action to refresh the number of available parking spaces on entrance screens. Parking suggestions for drivers are computed by the ParkingSuggestion context, which is invoked every time the ParkingAvailability context publishes a value. In this case, the computation carried out by ParkingSuggestion context requires also data from the ParkingUsagePattern context. The resulting suggestions are published to the CityEntrancePanelController, which refreshes these suggestions on entrance panels. The average occupancy level functionality is designed in a similar fashion with the exception of providing computations over a 24-h period (i.e., AverageOccupancy context).

Fig. 1
figure 1

The graphical view of the parking management application

2.2 Preliminaries

Let us now briefly present the salient features of DiaSwarm declarations through fragments of the design of our case study, displayed in Fig. 2. Note that we omit details on controller components and actuators. The complete design for the parking management application and further information on DiaSwarm can be found on our website.Footnote 4

2.2.1 Service discovery

DiaSwarm service discovery is part of the design phase. The language provides application-specific high-level constructs for discovering objects in the large. The grouped by clause allows sensor data to be presented to applications through subsets of interest. In the case of the, ParkingAvailability context, parking spaces are gathered together in parking lots, as shown in line 3. Similarly, in line 10, the AverageOccupancy context groups presence values by parking lots and computes average occupancy over 24 h.

2.2.2 Data gathering

DiaSwarm provides three data delivery models, inspired by the domain of wireless sensor networks [18], namely periodic, event-driven and query driven. Data delivery declarations are called interaction contracts. Examples are listed in lines 2 and 9, where both ParkingAvailability and AverageOccupancy contexts require presence measurements to be provided every 10 min. Thus, according to these interaction contracts both context components will be activated every 10 min with presence values. Furthermore, the event-driven model provides data to context components upon an event of interest (e.g., intrusion). The query-driven model allows a context to request data from devices and other contexts.

2.2.3 Programming frameworks

To enforce domain-specific functionalities (e.g., service discovery) during programming, Java programming frameworks are produced by a compiler from DiaSwarm designs. These frameworks provide an abstract class for each component, which in turn requires developers to implement components by subclassing every abstract class.

2.3 Data processing

Although high level, the DiaSwarm declarations suggest data processing models. Specifically, an application is reactive and consists of chains of component activations. A chain is executed when its initial activation condition holds, which is always related to a sensor, and depends on its delivery model: a sensor publishes data spontaneously or is sampled periodically. The execution of a chain ends if one or more actuators are invoked or a component does not publish any value. Additionally, when a component declaration groups values (e.g., grouped by parkingLot), it will process a sequence of values, indexed by the grouping attribute (i.e., parkingLot). For example, in the ParkingAvailability component, the processing will receive values from all the presence sensors, indexed by parking lot identifiers (i.e., ParkingLotEnum). Additionally, this construct allows values to be accumulated over a period of time, as illustrated by the AverageOccupancy context (line 8). The declaration in line 10 allows presence values, not only to be grouped by parkingLot, but also to be accumulated over a 24-h period (keyword every).

Fig. 2
figure 2

Excerpt of the parking management application design in DiaSwarm

3 Exposing parallelism

The large amount of data collected from sensors calls for efficient processing strategies. We now examine how an application design influences the way data are processed. This study allows us to propose extensions to DiaSwarm and novel treatments of declarations to generate efficient parallel processing of large-scale datasets.

Fig. 3
figure 3

An implementation of the ParkingAvailability context with MapReduce

Our aim is to put in synergy design and programming by leveraging design declarations to expose parallelism and allow efficient processing strategies to be implemented. An ideal case study is the grouped by directive because it partitions a large set of gathered data and exposes a processing strategy that matches the MapReduce programming model. Indeed, this programming model is dedicated to processing large datasets in a massively parallel manner [14, 15]. It requires processing to be split into two phases: Map and Reduce. Following our approach, data processing needs to be reflected in the design phase. This is done by extending the grouped by directive with an optional clause that specifies what types of values are produced by both the Map and Reduce phases. This is illustrated in Fig. 2, where the ParkingAvailability declaration includes a MapReduce clause that declares the Map phase to produce Boolean values and the Reduce phase to produce Integer values.

The DiaSwarm compiler generates a programming framework that requires the developer to provide an implementation for both the Map and Reduce phases of the data processing. As shown in Fig. 3, this is done by implementing map and reduce methods declared in the generated MapReduce interface. In conformance with the MapReduce model, the Map function is passed a key and a value, which correspond to the parking lot identifier (i.e., the attribute of the grouped by directive) and an availability status, provided by the corresponding sensor. The emitMap method is invoked to produce each key/value pair result of the Map phase. The framework-generated code groups the results of the Map phase into a list that is then passed to the Reduce phase. This phase sums up the set of values associated with a given intermediate key and, subsequently, emits the availability of a parking lot (emitReduce). The data resulting from the MapReduce computation are presented to the developer in the form of a map (line 21). The onPeriodicPresence method (line 21 to 30) wraps data resulting from the MapReduce process into the availabilityList sequence (line 26), which is returned to subscribed components (i.e., ParkingEntrancePanelController, ParkingSuggestion).

Although our example involves simple processing, in practice, our design-driven generative approach reduces programming efforts by automatically generating application-specific MapReduce programming frameworks. Furthermore, the generated code keeps the development process straightforward since it prevents specificities of the MapReduce implementation (job scheduling/configuration/execution, distributed file system, APIs, etc.) to percolate into the application logic.

Fig. 4
figure 4

The generated support for integrating Apache Hadoop

4 Generating a programming framework

Our design-driven development approach facilitates the processing of large datasets collected from sensor infrastructures by providing the developer with a customized framework, following the MapReduce programming model. In this section, we show how generative programming is used to produce support for combining an orchestrating application with an actual implementation of MapReduce, namely Hadoop.

Apache Hadoop is an open source implementation of the MapReduce paradigm, which has gained increasing attention over the last years and is currently being used by a number of companies, including IBM, LinkedIn, Facebook and Google [19]. In our approach, our compiler generates a MapReduce program that relies on the Hadoop framework. Furthermore, this MapReduce program defines default configuration parameters that enable a job to be executed in Hadoop.

In the next three subsections, we explain how Hadoop jobs are automatically generated, and how they are integrated in the control and data flow of the generated framework. The subsequent section discusses the possible integration of big data processing backends other than Hadoop, for supporting continuous stream processing in addition to batch processing.

4.1 Setting up Hadoop jobs

Let us describe first how Hadoop jobs are automatically generated, by examining the code generated for the ParkingAvailability context, shown in Fig. 4. The ParkingAvailabilityJob class defines a Hadoop MapReduce program, which comprises the definition of both the map and reduce methods along with code related to the job configuration and execution. Both the Map function and the Reduce function are implemented by overriding the map and reduce methods of the respective Mapper and Reducer interfaces. Typically, when using the Hadoop MapReduce library, the definition of the map and reduce methods resides in the MapReduce program. In this case, however, the implementation of these operations has already been provided by the developer in the ParkingAvailability class. The MapReduce program invokes the user-defined map and reduce methods via the ParkingAvailabilityParser class, which keeps an instance of the ParkingAvailability context. ParkingAvailabilityParser interprets input data of the MapReduce program as corresponding DiaSwarm types and invokes the required map/reduce method. Consequently, results from the user-defined map/reduce method are translated to the MapReduce program and submitted via its output collector.

Figure 5 shows the ParkingAvailabilityJob class, which defines the MapReduce program for the ParkingAvailability context. The compiler generates a minimal MapReduce program for every context declared as MapReduce at design time. The type of input data for a generated MapReduce program is defined by the input format, which defaults to TextInputFormat (line 22). In our approach, sensor data are stored in the JSON format. In our case study, each presence status delivered to the application is converted to JSON and occupies precisely one line in the resulting dataset. Furthermore, each presence entry is defined by the timestamp of the event, device attributes (i.e., id, parking lot) and the presence source. TextInputFormat fits such usage since it splits the input dataset to provide the Map function with one line of text (i.e., one JSON entry) at a time. In a MapReduce program, any key or value type implements the Writable interface, which allows Hadoop to serialize objects for transmission over the network [20]. To facilitate the development of MapReduce programs, Hadoop already provides Writable wrapper classes for the majority of Java primitives (e.g., boolean \(\rightarrow\) BooleanWritable). In addition, developers may provide custom data types by defining classes implementing the Writable interface. At this stage, design declarations are of great importance since they allow the compiler to interpret key and value types of the resulting MapReduce program. For instance, as shown in Fig. 2, the ParkingAvailability context declares the output value type of the Map function as Boolean (line 4). As a result, the compiler matches the Boolean data type with the corresponding BooleanWritable wrapper class (Fig. 5, line 6). Moreover, an enumeration is interpreted as a string and matched with the Text wrapper class (Fig. 5, line 6). Finally, design declarations using complex data types result in the generation of a custom wrapper class, which implements the Writable interface and reflects the entire structure of the data type.

4.2 Managing control flow

The control flow of the generated framework depends upon the declared interaction contracts between sensors (devices) and the application logic (contexts and controllers). In particular, the contexts include those declared as MapReduce jobs, which are automatically generated as shown above. Depending on the interaction contracts specified, the following scheduling strategies are chosen:

  • If an “every” clause is specified with a period T, the corresponding context is invoked with this periodicity.

  • Otherwise, if “when periodic” is specified with a period T, the context is invoked with this periodicity.

  • Otherwise, if “when provided” is specified, the context is invoked any time the sensor produces a value.

  • Otherwise, “when required” remains the only possible option. In this case, the context is invoked only when explicitly requested by a higher-level context declaring a “get” on the current context.

In our case study, the ParkingAvailability context declares that data must be gathered from presence sensors in a 10-min time window, according to a periodic delivery model (Fig. 2, line 2). Data processing takes place when the time window elapses; that is, every 10 min, for our case study. At runtime, this job is executed with respect to the gathered sensed data and produces a result. The orchestrating application recovers the result, which is passed to the context via its callback method (e.g., onPeriodicPresence for ParkingAvailability).

4.3 Managing the data flow

Managing the data flow in the generated framework involves (1) supplying data from sensors in suitable data structures, according to context declarations (e.g., “group by” clauses), and (2) buffering data if needed, to interface between sensor delivery models and the scheduling strategies defined above.

The various clauses in context interaction contracts are processed as follows by the compiler:

  • If “when required” is specified in a contract, there are no sensor values involved. Rather, such a contract declares that the value produced by the context is kept available for subsequent requests from higher-level contexts, or by controllers. There is no buffering of older values produced by the context: only the last value produced is available to client components. In such a contract, it is not possible to specify a “grouped by” clause, nor “every” or “map ...reduce”.

  • If “when provided” is specified, the context will receive all the values produced by the sensor or lower-level context; no value is lost.

  • If “when periodic ...\(\langle T \rangle\)” is specified, the context will receive values sampled with a periodicity of T from the sensor or lower-level context.

  • If “grouped by a” is specified, data must come from a sensor device, and a must be one of the device attributes. In this case, the sensor values are indexed by the value of attribute a. If a “map ...reduce” clause is also specified, key-value pairs \(\langle k, v\rangle\) are supplied to the map phase, where k is the value of attribute a for the sensor that produced value v. If no “map ... reduce” clause is specified, the context receives pairs \(\langle k\), list(\(v)\rangle\), where the v values were produced by all the sensors whose attribute a is equal to k.

  • If “every \(\langle T \rangle\)” is specified, data must come from a sensor device. In this case, values from the sensors (gathered as specified by its event-driven or periodic delivery model) are accumulated during a period T and then passed together in a single context invocation.

In our case study, the AverageOccupancy context receives values sampled from all presence sensors every 10 min, indexed by parking lot, and accumulated over periods of 24 h. Data processing takes place every 24 h, and invokes a MapReduce job to efficiently cope with the size of the batched data.

4.4 Other data processing methods

Nowadays, the field of big data is attracting much attention from research and industry. The tool development efforts devoted to dealing with rapidly emerging sources of big data result in an abundance of open-source projects [21]. Apache Hadoop is a widely used tool to deal with large-scale datasets because it provides a reliable and scalable solution, maintained by a large community of developers. Hadoop is a batch processing tool, typically used to analyze log files of large-scale systems, collected over a long period of time. The order of magnitude of these systems may range from hundreds of gigabytes to terabytes and, possibly petabytes. Apache Spark [22] is an alternative large-scale, data processing tool, which is gaining popularity due to its promise to outperform Hadoop by 10x [23]. Spark is an in-memory, data processing framework, which builds upon fault tolerant abstractions, manipulated using a rich set of operators, called resilient distributed datasets (RDDs) [24]. In contrast with batch processing tools, Apache Storm [25] primarily targets the processing of unbounded streams of data. Storm is an example of a CEP [26] system, where the data flow through a network of transformation entities. An application topology forms a directed acyclic graph, where stream sources (spouts) flow data to sinks (bolts); it implements a single transformation on the provided stream. In the context of large-scale orchestration, the power of batch processing tools can be leveraged to analyze long-term datasets for trends in the usage of the city’s infrastructure (e.g., parking lots) and to identify structural degradation (e.g., buildings, bridges). Stream processing tools, on the other hand, are best-suited to deal with high-frequency sensor readings, which typically involve tracking applications (e.g., vehicle position, parking place availability). In the future, we intend to extend the parallel data processing compiler to integrate both Spark and Storm, allowing developers to choose the right tool for their project.

5 Experimental evaluation

To assess our approach, we have conducted a series of tests to examine the overall behavior of the MapReduce programming model for processing large amounts of sensor data. To do so, we developed a prototype of the parking management system, with Hadoop as the target platform, and analyzed the scalability of our approach using various datasets. In addition, we evaluated the design of the application and observed how specific design choices may impact the overall performance of an orchestrating application.

5.1 Experimental setup

The experimentation focuses on the average parking occupancy feature of our case study. The AverageOccupancy context processes sensor data synthesized for a 24-h period, calculates the average occupancy of a parking lot, and notifies the parking manager via a Messenger device.

5.1.1 Machines

The experiment was carried out on a cluster of 12 nodes running within a private Eucalyptus [27] cloud. Each node in the cloud corresponds to a m2.xlarge type virtual machine instance with 2 CPUs, 2GB of RAM and 10GB of disk space. Every instance ran the DataStax Enterprise 4.6.1 [28] image, which is a big data platform leveraging tools such as Apache Hadoop and Apache Spark.

5.1.2 Datasets

We generated synthetic datasets to simulate a city’s sensor infrastructure for the parking management system. Each dataset contains sensor data, indicating parking space occupancy, which is emitted every 10 min over 24 h (i.e., 144 measurements per sensor). We generated datasets for different sensor infrastructures, ranging from 10,000 to 200,000 sensors per dataset, thus testing the MapReduce program with datasets including up to 28,800,000 input records. The values for presence sensors are generated randomly, not according to any particular distribution. We did not attempt to simulate a realistic occupation of parking spaces, since the computation time in our prototype application is independent from the distributions of occupation times, and we focus here on evaluating just the scalability of the generated framework.

Fig. 5
figure 5

An example of the generated Hadoop MapReduce program for the ParkingAvailability context

5.2 Experimental results

5.2.1 Scalability

Figure 6 shows the performance of our parking management program. We compare its execution time with respect to 3 cluster setups—one, six, and twelve nodes—and an increasing input dataset size. As can be expected, the execution time of the one-node setup increases the fastest, compared to the six- and twelve-node setups. The six- and twelve-node setups perform at par for the smallest dataset sizes (from 10,000 to 50,000 sensors) because their computing power is under-used. As the size of the datasets increases, the performance of these two setups gradually separate, showing better performance for the twelve-node setup. These preliminary results show that our compiler generates MapReduce implementations that attain expected scalability. Furthermore, these results demonstrate that declarations at the design level can benefit performance by driving compilation strategies, such as parallelization in our case study. This is achieved by introducing high-level insights (MapReduce constructs) in DiaSwarm.

Fig. 6
figure 6

Performance comparison between different cluster setups

5.2.2 Optimization through design

Beyond significantly improving the execution time of an orchestrating application, Hadoop opens up further optimization opportunities at the design level. For instance, in our case study, the AverageOccupancy context processes a dataset of presence values to produce the average occupancy of each parking lot for the last 24 h. A closer look at the application design reveals that the computation provided by the AverageOccupancy context could be achieved by leveraging the computation of the ParkingAvailability context. The computed availability of parking spaces could thus be provided to the AverageOccupancy context at regular intervals, defined by the data delivery contract (i.e., \(\langle 10\,\hbox {min}\rangle\)) of the ParkingAvailability context. As a result, the AverageOccupancy context would use the provided data to calculate an average over the period of 24 h.

The suggested design adjustments are depicted in Fig. 7. As can be noticed, the design of the application remains straightforward. More importantly, this design prevents sensor readings from being processed multiple times: the AverageOccupancy context factorizes the computations performed by the ParkingAvailability context. This caching strategy reduces the total time and resources the application requires for data processing. In fact, as shown in Fig. 7, the computation performed by the AverageOccupancy context no longer involves processing of a large dataset on a cluster (hence the MapReduce clause is omitted).

This major optimization also has a direct impact on application upkeep costs, since nowadays companies delegate processing of large datasets to cloud computing platforms (e.g., Amazon Web Services) with a time-of-use pricing model.

Fig. 7
figure 7

The ParkingAvailability context factorizing the computation performed by AverageOccupancy

6 Enabling new services

The experimental results reported in the previous sections show that the integration of the MapReduce programming model in DiaSwarm and its Hadoop backend fulfill the promise of handling large amounts of data coming from massive sensor infrastructures. This section demonstrates on a concrete application scenario how the scalable data processing of our approach enables the development of new services involving personalization and community awareness. The example application is a community-aware extension of our parking management application.

The ParkingManager application described in Sect. 2 includes a context called ParkingSuggestion for displaying parking suggestions on city entrance panels. These suggestions are generic in that they are visible to all the drivers entering the city and contain a selection of parking lots having the best availability at a given moment. However, the added value of parking suggestions can be considerably enhanced by providing personalized suggestions to each driver entering the city based on their intended destination and estimated time of arrival. This important improvement can be done by integrating the ParkingManager application with a community-based navigation system such as WazeFootnote 5 or Google Maps.Footnote 6

These community services can be integrated in DiaSwarm applications in the form of a software sensor device called ParkingCommunityRadar, defined in Fig. 9. This device produces data of type Expectation when a trip is (re)computed in a community-based GPS navigation application used by a driver, or when the estimated time of arrival for an ongoing trip changes significantly. An instance of this software sensor is associated to each parking lot, as indicated by its parkingLot attribute. For privacy reasons, the destination reported to the ParkingCommunityRadar is not the ultimate destination of the driver (e.g., precise GPS coordinates), but rather the intended parking lot for leaving the car. The device also features a SuggestParking action which allows the application to send a personal parking suggestion to a specific car driver. Drivers are identified based on the user identifier in the navigation application, or based on the license plate number, for instance.

Fig. 8
figure 8

The graphical view of the community-aware parking management application

Using this software sensor, personalized suggestions can be sent to drivers by adding the PersonalParkingSuggestion context to the ParkingManager application, as shown in Fig. 8. The new context is declared in Fig. 9.

Fig. 9
figure 9

The PersonalParkingSuggestion context

The PersonalParkingSuggestion context receives data from all the ongoing trips and sends personal parking suggestions only when the intended parking lot is predicted to be full at the estimated time of arrival. In this case, an alternative nearby parking is suggested for which better availability is predicted. The availability predictions reuse the ParkingUsagePattern context described in Sect. 2, which contains a predictive model based on the parking usage history. We assume that this model provides predictions at a granularity of 10 min, but coarser-grained models could be accommodated using interpolation.

The interaction contract of this context specifies that destinations are collected in an event-driven model, but processed only every 10 min. This allows to check the future availability of the parking lots (according to the predictive model) by cumulating parking requests for the near future, from ongoing trips. Thus, the destinations accumulated over the last 10 min are first partitioned, according to the estimated time of arrival in future intervals of 10 min. For each such interval, the predicted availability of each parking lot is compared to the number of estimated arrivals; if the result shows a shortage of available parking spaces, cars in excess are sent an alternative parking suggestion. The cars to be redirected are the ones with the latest estimated time of arrival. Indeed, assuming that the estimated times and the predicted availabilities are accurate, the redirected cars are precisely those that will find a saturated parking lot. The other cars are not notified in any way, saving the attentional resources of the drivers.

According to the declarations in the context, a MapReduce job is scheduled every 10 min to cope with a large potential number of ongoing trips, while ensuring timely notifications for possible redirections. The map phase rounds estimated arrival times to the closest 10-min interval. The reduce phase computes the cumulated demand on a parking lot for each time slice. Based on the MapReduce result, the context compares the demand with the predicted availability for each parking lot to produce a (possibly empty) list of parking redirection suggestions. The output of the context is published to the CommunityRadarController component, which sends the corresponding notifications to the concerned drivers via the ParkingCommunityRadar devices of the overloaded parking lots.

This enhanced version of the parking management application provides a personalized service to drivers, preventing parking lots to be overloaded. Detection of potential parking overload is based on community data of ongoing trips, provided by drivers accepting to communicate their destinations in exchange for more reliable guidance to find available parking spaces. However, this feature requires efficient computation support for large data, ensured in our approach by MapReduce design annotations.

7 Extensions

The presentation of our approach and its evaluation show that we offer a convenient and efficient way to express large-scale computations over massive sensor infrastructures, at the scale of the Internet of Things. Furthermore, this processing power enables more personalized services to be delivered to users, without compromising service responsiveness. This section takes a step back to address the current limitations of DiaSwarm in terms of expressiveness. We review the underlying language design rationale and introduce language extensions to lift these limitations. Some of these extensions are implemented and released in our available prototype.Footnote 7

7.1 Specifying intermediate keys

As shown in Sect. 3, the MapReduce programming model is exposed at a high level of abstraction to developers. Indeed, they only have to declare that a context such as ParkingAvailability is to be implemented in the MapReduce model and provide the types of values produced by the map and reduce phases (as in Fig. 2). Then, the developers specify the map and reduce computations (as in Fig. 3), but are abstracted away from implementation issues, specific to a MapReduce backend engine, such as implementing and configuring Hadoop jobs.

Fig. 10
figure 10

The general MapReduce programming model

However, this high level of abstraction currently hides not only implementation details of the backend, but also some of the power of the MapReduce model itself. Specifically, the MapReduce model allows to use two different sets of keys for indexing data in the map and reduce phases, as shown in Fig. 10. Currently, the DiaSwarm language allows to specify only a single key set—an attribute of sensor devices—, to group the incoming data and its processing, along the map and reduce phases. This allows to use natural partitionings of sensors, according to attributes such as their location. We have chosen this strategy as it fits the major common case of large-scale sensing, while imposing minimal effort on developers. Nevertheless, more complex MapReduce computations could be specified if DiaSwarm allowed to specify a different key for the intermediate values, so as to exploit parallelism along another partitioning dimension.

For instance, an application for computing a real-time histogram of the target temperatures in all the homes in a city equipped with a heating, ventilation and air-conditioning (HVAC) system could group the readings geographically by street blocks for the map phase, but would use the discrete temperature value (an integral number of Celsius or Fahrenheit degrees) as a key in the reduce phase to compute the frequency of each value.

To cope with this limitation of DiaSwarm, we extended the language syntax to allow specifying the intermediate key of the reduce phase. These keys could be used in the generated programming framework with very small changes of the Hadoop backend.

7.2 Parallelizing higher-level contexts

DiaSwarm allows designing MapReduce contexts taking the input from many instances of a sensor device, such as the ParkingAvailability context in Fig. 2 taking its input from a large set of PresenceSensor device instances. Currently, it is not possible to specify a map/reduce clause in DiaSwarm for a higher-level context, that is, one which takes input from another context, rather than from a sensor device. Indeed, declaring a MapReduce context requires grouping incoming data by a sensor device attribute, which does not exist when data originates from a lower-level context. This choice has been taken during the design of the DiaSwarm language for two reasons:

  • contexts directly connected to a device produce refined information that is usually much more concise than the large raw dataset captured by sensors (this semantics is also implied by the “reduce” phase of a MapReduce context)—as a consequence, higher-level context tend to operate on smaller datasets;

  • the number of actuators in an application is typically much smaller than the number of sensors, eliminating another potential need for computing large datasets in higher-level contexts (such a large dataset would be necessary if a massive amount of actuator instances would have to be actuated differently by the application).

However, in some complex applications, the first assumption may be violated, as some lower-level contexts may also produce large datasets—perhaps not computed by a MapReduce, but rather accumulating some form of big data that cannot be summarized without loosing its predictive value. Furthermore, the second assumption may not be fulfilled in applications involving a large number of actuators—this is indeed a natural trend in personalized applications. When either situation arises, it could make it worth to also parallelize some of the higher-level contexts in an application.

For instance, if the parking suggestions of the ParkingManager application in Fig. 1 would have to be shown, not only at city entrances, but also on many public displays all over the city, these suggestions would have to be localized with respect to nearby parking lots and to their predicted usage patterns. In this situation, it would be a stringent requirement to parallelize the ParkingSuggestion context along parking lots or public displays, for example.

This current limitation could be removed by allowing to group the output of array-valued contexts using for instance a field of the indexed data structure (in our case, the parking lot field in the array entries produced by the ParkingAvailability context). Also, allowing to specify a map/reduce clause on a controller component could also open up many possibilities. This latter feature would require to allow grouping actuator devices along one of their attributes, similar to how sensors are currently handled in DiaSwarm.

7.3 Grouping by a computed attribute

The fact that sensor readings can only be grouped by a sensor device attribute also implies other limitations in the practical applicability of current DiaSwarm language. More precisely, grouping values by a device attribute works well when this attribute is of a large enumerated type, which is the case in many smart city applications, such as the list of parking lots in a city, or offices in a building, etc. However, other cases may not fit with this model. For instance, a particular component of a device attribute (such as the city code in the license plate number of a car) could be more suited to cluster computations on sensor readings. Also, when mobile sensor are used, with their current location expressed as GPS coordinates, it makes more sense to cluster the readings with respect to regions delimited as GPS intervals, instead of precise GPS positions.

This limitation could be removed by allowing in a grouped-by construct, not only a device attribute name, but also a general expression involving a device attribute, such as grouped by department(plateNo) or grouped by region(gpsLocation).

7.4 Grouping heterogeneous sensor readings

In DiaSwarm, the MapReduce computation declared by a context can only process uniform sensor data, that is, originating from all the different instances of a sensor device. The context may of course get values from other “secondary” sensor devices, but these values are only taken into account after the map and reduce phases, for computing the final value produced by the context. For instance, Fig. 11 shows a context computing the average pollution produced by a car in a parking lot, dividing the pollution sensed with a CO sensor at the end of the day by the average number of cars present during that day (see line 9). The pollution value in this case is not available to the MapReduce computation, but will only be passed as an extra argument to the onPeriodicPresence method in Fig. 3.

Fig. 11
figure 11

Current syntax of DiaSwarm for grouping multi-sensor measurements

Some applications could benefit from performing parallel computations on heterogeneous sensor data. For instance, a more accurate way to compute the average pollution in a parking lot would be to average all the instant pollution values. This would require sensing the CO level at the same time as when occupancy is computed (i.e., every 10 min), accumulating these simultaneous values over the computation period (24 h), compute all the instant pollution values in the map phase, and compute their average in the reduce phase.

Removing this limitation would involve extending the syntax of DiaSwarm to allow specifying the “get” clause before the “grouped by” clause, thus expressing that these sensor values should be grouped together. It would also require that the generated programming framework samples the value of each secondary sensor each time a primary sensor is sampled, so as to accumulate complete snapshots of heterogeneous sensor readings.

7.5 Optional map and reduce clauses

When a context is declared as being computed with a MapReduce operator, the developer always has to specify a map and a reduce user-defined methods. However, in many common cases, either the map or the reduce are the identity function. This is not really a limitation of the current DiaSwarm language, because it does not forbid to implement such applications. Rather, some unneeded effort can be avoided by making both the map and reduce clauses optional, with the effect of automatically generating identity transformation. To do so, we extended DiaSwarm to allow either phase to be omitted in the declaration.

8 Discussion

This section discusses the applicability of our approach to other scenarios, and some of its current limitations.

8.1 Applicability

The ParkingManager application used as our case study allowed us to illustrate the approach and its use on a concrete scenario. We deliberately chose an example involving straightforward computations for clarity reasons, enabling to show (1) concise and easy to understand snippets of the generated framework (Fig. 5) and (2) placeholders to be filled by developers (Fig. 3). Of course, computing parking availability by summing up unused spaces is not a compelling application for parallel processing. The extension to a personalized service discussed in Sect. 6 gave a more accurate view of realistic data-intensive applications. Real personalized services related to parkings which have been proposed include dynamic pricing schemes, where prices are continuously computed and updated with respect to sensing data streamed from the infrastructure, in order to optimize parking usage [29]. The same kind of dynamic pricing has been used for other scarce resources in smart cities such as shared mobility systems (pools of shared electric vehicles or bikes) in order to optimize their geographical redistribution [30]. These are concrete examples of real-time computations based on massive sensor deployments that could benefit from our domain-specific parallelization approach.

More data-intensive smart city services needing parallel computing will gradually appear following massive deployments of sensors and actuators, for such domains as traffic, weather, or energy consumption monitoring.

8.2 Limitations

One limitation of our approach is that it does not address the physical placement of computations on network nodes. In our current implementation, all computations are performed in the cloud, after data are gathered from all the sensors necessary to a given context (as specified in a “when provided/periodic” clause). The possibilities available in many sensor networks for in-network processing or distributed computing are not currently exploited. In principle, the graph of application components might be used to automatically push in the network some contexts that are “close” to sensors. Another approach would consist of extending DiaSwarm to allow explicit hints to be formulated in a declarative way. Optimizations of the underlying sensor network, such as optimized routing, could be triggered at different phases, e.g., at deployment time or during runtime. We have proposed a vision including such ideas elsewhere [31], but they are not currently implemented in our prototype.

9 Related work

In this section, we examine existing approaches that address the development of applications orchestrating sensors. We consider approaches from domains where orchestration of sensors is a common concern. Furthermore, we highlight the differences between our approach and large-scale data processing support.

9.1 Internet of things (IoT)

Patel et al. [32] propose a multistage, model-driven approach, dedicated to the development of IoT applications. This approach provides support at different stages of the development process. At design time, the approach offers a set of customizable modeling languages for the specification of an application. The approach is complemented by code generation and task-mapping techniques for the deployment of node-level code onto devices. Even though this approach is aimed to facilitate the development process through guidance, Patel et al. do not provide details regarding the size of sensed data that are gathered and processed. They do not discuss what support is generated to facilitate the programming process. This approach does not address how masses of sensors are handled nor does it present performance measurements to assess how it scales up for large datasets.

9.2 Pervasive computing

The domain of pervasive computing offers a number of approaches targeting the development of orchestrating applications. PervML [33] is a model-driven development approach that provides a conceptual framework for context-aware applications. The various aspects of a pervasive computing application are modeled by different types of UML diagrams. Dey et al. [34] propose the Context Toolkit that provides the programmer with building blocks to mediate between the contextual aspects of the environment and the application. Olympus goes beyond middleware in providing a programming framework dedicated to the development of pervasive computing systems [35]. Because it is based on a domain-specific framework, Olympus raises the level of abstraction and facilitates the development of applications. DiaSuite takes these approaches further by introducing a design language dedicated to the Sense/Compute/Control paradigm [36, 37]. A design is used to generate a dedicated programming framework that guides, restricts, and supports the implementation phase.

All the above-mentioned approaches have been designed for the orchestration of objects in the small (i.e., offices, buildings, etc.). They do not address challenges arising with large-scale infrastructures and do not provide strategies to tackle data-intensive processing.

9.3 Wireless sensor networks (WSN)

Gupta et al. [38] propose sMapReduce, a programming pattern inspired by the MapReduce programming model for mapping application behavior onto a sensor network and enabling complex data aggregation. sMapReduce divides the network-level user program into sMap and Reduce functions; this strategy, respectively, associates a behavior to sensor nodes and executes data aggregation over the network. Compared to our approach, sMapReduce remains lower-level since it provides network-level programming abstractions and introduces the network topology in computations.

Often, programming applications for WSNs is done at a low level, requiring the developer to have extensive knowledge about the underlying layers (network, hardware, OS). Mottola and Picco [39] surveyed a number of programming approaches for WSNs aimed to facilitate the programming of layers underlying applications; these approaches target sensor nodes, communication operations, routing strategies, etc. These works are complementary to ours in that they provide high-level abstractions that can be used by our compiler to target frameworks for WSNs. However, they do not provide support dedicated to dealing with large datasets produced from massive-scale sensor infrastructures.

9.4 Large-scale data processing

Apache Pig [9] and Apache Hive [10] are widely used as high-level platforms for analyzing large-scale datasets. These platforms provide SQL-like declarative query languages (i.e., PigLatin & HiveQL) to express data analysis programs. These tools are well-suited for offline data analysis, but require some effort for running scripts from application code (e.g., setting up a connection with a JDBC server). Sawzall [40] used by Google is a high-level scripting language for automating analyses on large datasets on top of the MapReduce execution model. Sawzall is not publicly available but is reported to improve the programming significantly, compared to C++ programming of MapReduce. High-level language libraries, such as FlumeJava [11], provide high-level abstractions dedicated to parallel processing; they provide support for user-defined functions, compared to SQL-like approaches.

Compared to the above-mentioned supports, our approach integrates, at the design level, two domain-specific fundamental dimensions: large-scale orchestration of sensors and large-scale data processing. The integrated nature of our approach allows developers to easily combine results from various computations. The design-driven nature of our approach is supported by high-level declarations, exposing such domain-specific information as service discovery and data delivery. Declarations are analyzed to determine data and control flow information, which in turn, is used to generate efficient, parallel-data processing frameworks.

10 Conclusion

We have proposed a design-driven approach to developing orchestrating applications for masses of sensors that integrates parallel processing of large amounts of sensed data. Our new approach provides the developer with design declarations expressing when and where data processing occurs. A compiler takes an application design as input and produces a programming framework based on the MapReduce programming model. The generated framework supports and guides the programming of the orchestration logic, while abstracting over the parallel processing of sensed data.

We have demonstrated that our approach creates synergy between design and programming, allowing seamless introduction of high-performance computing strategies, as illustrated by the MapReduce programming model. We illustrated our approach with a case study of a parking management system. This case study was used to conduct an experiment on Apache Hadoop, demonstrating how our design-driven approach can be leveraged to parallelize the processing of large datasets and obtain significant speedups.

In the future, we intend to support the processing of unbounded streams of data, typical of sensors. Our declarative approach will allow us to design orchestrating applications that mix the processing of both large datasets and unbounded data streams, allowing us to abstract away these aspects.