Keywords

1 Introduction

Siegel et al. [35] argue that scalability is needed to support the continued expansion of the Internet of Things. Therefore, performance engineering studies are very important for understanding tradeoffs between security, availability, and response time of various types of IoT applications.

Workload characterization is a fundamental and necessary step in carrying out any performance engineering study [26]. The workload of a system is defined as the set of all inputs received by the system from its environment during one or more time windows. The characterization of the workload entails determining the nature of its basic components (e.g., transactions, I/O requests, IoT device requests) as well as a quantitative and probabilistic description of the workload components in terms of both the arrival process, event counts, and service demands (e.g., arrival rate of requests and interarrival time distributions, distribution of the number of IoT device signals received, distribution of the file sizes returned by an HTTP request) [26].

General methods for workload characterization have been discussed in [11, 12, 26]. Specific applications of these techniques to a variety of domains were developed by many researchers (see examples in Sect. 5). However, there is a need for workload characterization studies for IoT applications.

The recent development of Internet of Things (IoT) and edge/fog computing demands models for this new environment. Our prior work includes the development of an analytic model, called FogQN, based on queuing networks [37] and an autonomic controller that uses FogQN to dynamically determine the optimal breakdown of processing between fog and cloud servers [38].

Any modeling effort of fog and cloud computing calls for workload characterization studies of IoT workloads. The understanding of the characteristics of IoT workloads can be used to perform capacity planning studies. These are the main contributions of this paper. More specifically, we (1) describe the methodology we used to analyze IoT traces; (2) describe and analyze three publicly available IoT datasets: NY city taxi trips, GPS trajectories of taxis in Beijing, Chicago taxi trips; and (3) present a capacity planning study based on the workload characterization of the NY city taxi trips. Our workload characterization includes counts of events, i.e., IoT device signals, at various time scales (e.g., hour of the day, day of the week) and a characterization of the interarrival time of signals received from IoT devices.

The rest of this paper is organized as follows. Section 2 describes the general data collection and analysis methodology used in this paper. Section 3 has one subsection for each of the datasets we analyzed. Each subsection describes the dataset and presents the results of the workload characterization for that dataset. Section 4 provides an example of how a queuing model can be used to answer what-if questions using the workload of NY city taxi trips. Section 5 discusses related work. Finally, Sect. 6 presents concluding remarks and future work.

2 General Data Collection and Analysis Methodology

The data collection and analysis methodology presented here can be applied to a variety of IoT workloads. This paper analyzed several publicly available IoT datasets. Some existing datasets are from applications in which data is sent by a set of sensors at regular intervals (e.g., every 5 min) in a synchronous way. We did not consider these datasets because they are not very interesting from the point of view of workload analysis. The applications we considered in our study have IoT devices that are independent of each other and send signals at irregular intervals (e.g., signals sent by a taxi cab whenever a passenger is dropped off).

Our analysis methodology consisted of the following steps:

  1. 1.

    Data is aggregated from all the files that make up the dataset.

  2. 2.

    The aggregated data is cleansed by removing any invalid and duplicate data, and any outliers.

  3. 3.

    The cleaned up data is sorted based on the timestamp of the records.

  4. 4.

    The sorted data is filtered based on characteristics such as days, hours, month, latitude/longitude of the IoT device.

  5. 5.

    The filtered data is characterized by computing event counts by hour of the day on a daily and monthly basis, and by day of the week.

  6. 6.

    The distribution of the interarrival time of signals generated by IoT devices is characterized. We used Quantile-Quantile (Q-Q) plots and Cumulative Distribution Functions (CDF) to that effect [21].

A Q-Q plot is a graphical tool that helps determine if the data points in a given data set come from the same distribution as a given theoretical distribution. A Q-Q plot is a scatter plot that plots two sets of quantiles (from the dataset and from the theoretical distribution) against each other. If both quantiles come from the same distribution, the points in the Q-Q plot form a roughly straight line. We experimented with several candidate theoretical distributions for each dataset and did a linear regression on the points. The distribution that had a coefficient of determination \(R^2\) closest to 1 was chosen as the best fit theoretical distribution for the dataset. The candidate distributions can only be those that can take non-negative values because an interarrival time cannot be negative. For that reason we selected the lognormal, Weibull, and Gamma distributions. Note that the Weibull distribution has the exponential distribution as a special case, depending on the value of its parameters.

Table 1 presents the expressions for the probability density function (pdf) and the expressions used to compute the parameters of the three considered distributions as a function of \(\bar{X}\), S and \(C = S / \bar{X}\), the mean, standard deviation and coefficient of variation of the interarrival times, respectively, computed from the datasets.

Table 1. Features of the lognormal, Weibull, and Gamma distributions.

The theoretical distribution quantile data is generated using the inverseCumulativeProbability method in the Java Apache Commons Math3 distribution package [2] with parameters computed using the equations in Table 1.

3 IoT Datasets

We describe and analyze in this section, three IoT datasets: NY city taxi trips, GPS trajectories of taxis in Beijing, and Chicago taxi trips.

3.1 New York City Taxi Trip Data

The New York City taxi trip data is provided by Illinois Data Bank, which is operated by the University of Illinois at Urbana Champaign. This dataset [15] contains records of four years (2010–2014) of taxi operations in New York City including 697,622,444 trips. The data is stored in the CSV format, organized by year and month. Each month’s data is stored in a separate file. Each row in the file represents a single taxi trip. Each trip records the pickup and drop-off dates, times, and coordinates, and the metered distance reported by the taximeter. For this analysis, we only considered the drop-off date and time, drop-off latitude and longitude fields. We assumed that a fog node is at Grand Central Terminal, whose latitude and longitude coordinates are (40.7527, −73.9772), and it serves all the IoTs devices (taxis) that are within a one-mile radius. This means that signals received from the taxis at drop off locations that are within a 1-mile radius are served by the Grand Central Terminal fog node. Therefore, we selected all the records that are within 1 mile radius from the fog node for this analysis. We cleaned up the data by removing duplicate and invalid entries and used the cleaned up data to generate interarrival times. We then removed the outliers (interarrival times greater than 2000 s) from the interarrival times dataset.

Figure 1(b) shows the variation of the number of taxi signals by hour of the day for Sunday, February 7, 2010 and Monday, February 8, 2010. It is apparent that taxi cabs are utilized more on Mondays (weekday) than on Sundays (weekend), with the exception of 12:00 am through 5:00 am. This may be because more people in New York use cabs on weekdays to move around. The number of taxi signals on the early hours of Sunday exceeds the taxi cab requests during the same time on Monday because people are more likely go out on Saturday nights, and they utilize taxi cabs to get back home during the wee hours on Sunday. However, at the same time on Monday, most people are at home resting for the next work day. Also, the number of taxi signals is higher during the morning (5:00 am to 9:00 am) and evening rush hours (4:00 pm to 6:00 pm) during a Monday because between these peaks most people are more likely to be working in their offices.

Next, we analyzed the number of taxi signals for the entire month of February, 2010 grouped by hour of the day as shown in Fig. 1(a). The figure shows that the number of taxi signals is lower during non-working hours compared to those of working hours. Also, there is a clear rise in the number of signals during morning and evening rush hours from 5:00–9:00 am and 4:00–7:00 pm, respectively.

Next, we studied the variation of the number of taxi signals by days of the week and aggregated the data for each day of the week of February, 2010 as shown in Fig. 2. The figure shows that the lowest signal counts are recorded on Sundays.

Fig. 1.
figure 1

(a) Left: NY Grand Central Terminal taxi signal counts aggregated by hour of the day for the entire month of February, 2010, (b) Right: NY Grand Central Terminal taxi signal counts by hour of the day for Sunday, February 7, 2010 (weekend) and Monday, February 8, 2010 (weekday)

Fig. 2.
figure 2

NY Grand Central Terminal taxi signal counts aggregated by days of the week for February, 2010

We now turn our attention to the characterization of interarrival times of taxi signals using Q-Q plots and CDFs as explained in Sect. 2. To determine the best fit distribution, the quantiles of interarrival times of taxi signals were plotted against those of various theoretical distributions (i.e., lognormal, Weibull and Gamma). Table 2 shows the parameters used for each distribution and the corresponding \(R^2\) value. The lognormal distribution has the best fit for the data with an \(R^2\) value equal to 0.941. The corresponding Q-Q plot is shown in Fig. 3(a). The CDF plots of taxi signal interarrival times and the lognormal theoretical distribution are shown in Fig. 3(b). They both match very closely. Based on the \(R^2\) value from the Q-Q plot and CDF plots, we can conclude that the data best fits the log-normal distribution.

Table 2. Fitting February 8, 2010 NY City taxi signal interarrival time data.
Fig. 3.
figure 3

(a) Q-Q plot (left) and (b) CDF plots (right) using NY Grand Central Terminal February 8, 2010 taxi signal interarrival times data and theoretical lognormal distribution data with \(\mu \) = −1.630 and \(\sigma \) = 1.494.

3.2 Microsoft T-Drive Trajectory Dataset

The Microsoft T-Drive Trajectory dataset [41] is provided by Microsoft for research purposes. This dataset contains the GPS trajectories of 10,357 taxis (one file per taxi) during the period of February 2–8, 2008 within Beijing. We ignored the data for February 2 and February 8 because they are incomplete. Each file of this dataset contains the trajectory of one taxi. The total number of points in this dataset is about 15 million and the total distance of the trajectories reaches about 9 million kilometers. We assumed that the fog node is located at Tiananmen Square, whose latitude and longitude are (39.9055, 116.3976), and that this node will serve the IoT devices (i.e., taxis) within a one-mile distance. We then selected all the records that are within a 1-mile radius from that node and used that data to generate the interarrival times of the signals. We then removed the outliers from the interarrival times data.

Figure 4(b) shows the the variation of the number of taxi signals by hour of the day for Sunday, February 3, 2008, and Monday, February 4, 2008. It is apparent that taxi cabs are utilized less over the night hours than during day time. Also, there are more taxis utilized during evening hours on weekends than weekdays.

Fig. 4.
figure 4

(a) Left: Beijing Tiananmen Square taxi signal counts aggregated by hour of the day for February 3–7, 2008, (b) Right: Beijing Tiananmen Square taxi signal counts by hour of the day for Sunday, February 3, 2008 (weekend day) and Monday, February 4, 2008 (weekday).

Next, we analyzed the number of taxi signals from February 3–7, 2008 grouped by hour of the day as shown in Fig. 4(a). The figure shows that the number of taxi signals is lower during night hours than during day time. A similar trend was seen in Fig. 5. This figure shows the variation of the number of taxi signals by days of the week from February 3–7, 2008. The highest number of taxi signals on weekdays can be seen on Mondays and it decreases through the week. The second highest number is observed on Sundays maybe because Tiananmen Square is a popular place for visitors and there are more visitors on weekends than on weekdays.

Fig. 5.
figure 5

Beijing’s Tiananmen Square taxi signal counts aggregated by days of the week.

Next, we characterized the interarrival times of taxi signals using Q-Q plots and CDFs as explained in Sect. 2. To determine the best fit distribution, the quantiles of interarrival times of taxi signals were plotted against those of various theoretical distributions (i.e., lognormal, Weibull and Gamma). Table 3 shows the parameters used for each distribution and the corresponding \(R^2\) value.

The lognormal distribution has the best fit for the data with an \(R^2\) value equal to 0.986. The corresponding Q-Q plot is shown in Fig. 6(a). The CDF plot of taxi signal interarrival times and lognormal theoretical distribution is shown in Fig. 6(b). They both match very closely. Based on the \(R^2\) value from the Q-Q plot and CDF plots, we can conclude that the data best fits a lognormal distribution.

Table 3. Fitting February 5, 2008 Tiananmen Square taxi signal interarrival time data.
Fig. 6.
figure 6

Q-Q plot (left) and CDF plots (right) using Beijing Tiananmen Square February 5, 2008 taxi signal interarrival times data and theoretical lognormal distribution data with \(\mu \) = −0.130 and \(\sigma \) = 1.111.

3.3 Chicago Taxi Trips Dataset

The Chicago taxi trips dataset provided by the City of Chicago’s open data portal [1] contains information on taxi trips in Chicago reported to the City of Chicago. We exported February 2015 data in a CSV format using their API. Each record in the file represents a single taxi trip and includes pickup and drop-off dates, times, and coordinates, and trip duration (in sec). The pickup and drop-off times are rounded to the nearest 15 min and the trip duration is rounded to the nearest minute, meaning that the trip durations are in multiples of 60 s. For this analysis, we only considered the trip end time (trip start time + trip duration), drop off latitude and longitude fields. We assumed that the fog node is at Millennium Park, whose latitude and longitude are (41.8826, −87.6226), and it serves all the IoT devices (taxis) that are within one-mile radius. Therefore, we selected all taxi trip records whose drop off location is within one-mile radius from the fog node for this analysis. We cleaned up the data by removing records with missing data and used the clean data for taxi trip count analysis. To compute the interarrival times, we grouped the taxi signals reported each minute and computed the interarrival times by distributing them uniformly within that minute.

Figure 7(b) shows the variation of the number of taxi signals by hour of the day for Sunday, February 22, 2015 and Monday, February 23, 2015. It is apparent that taxi cabs are utilized more on Mondays (weekday) than on Sundays (weekend), with the exception of 12:00 am through 6:00 am. This may be because more people in Chicago use taxis on weekdays to move around than on weekends. The number of taxi signals on the early hours of Sunday exceeds the taxi signals during the same time on Monday because more people are likely to go out on Saturday nights than on Sunday nights, and they utilize taxis to get back home in the early hours of the next day. Also, the number of taxi signals is higher during the morning (6:00 am to 9:00 am) and evening rush hours (3:00 pm to 6:00 pm) during a Monday (weekday) because people are more likely to use taxis to go to work and go back home during these times.

Next, we analyzed the number of taxi signals for the entire month of February, 2015 grouped by hour of the day as shown in Fig. 7(a). The figure shows that the number of taxi signals is lower during non-working hours compared to those of working hours. Also, there is a clear rise in the number of signals during morning and evening rush hours from 5:00 am to 9:00 am and 3:00 pm to 6:00 pm, respectively.

Fig. 7.
figure 7

Chicago Millennium Park taxi signal counts. (a) Left: aggregated by hour of the day for the entire month of February 2015, (b) Right: by hour of the day for Sunday, February 22, 2015 (weekend) and Monday, February 23, 2015 (weekday).

Next, we studied the variation of the number of taxi signals by days of the month and aggregated the data for each day of the month of February as shown in Fig. 8(a). The figure shows that the signal counts are higher on weekdays than on weekends and the lowest signal counts are seen on Sundays every week.

Next, we studied the variation of the number of taxi signals by day of the week and aggregated the data for each day of the week of February 2015 as shown in Fig. 8(b). The figure shows that the weekday counts are higher than the weekend counts and increase from Monday to Friday. Also, lowest signal counts are recorded on Sundays.

Fig. 8.
figure 8

Chicago Millennium Park taxi signal counts. (a) Left: for each day in February 2015 (b) Right: aggregated by days of the week in February 2015.

We then characterized the interarrival times of taxi signals using Q-Q plots and CDFs as explained in Sect. 2. To determine the best fit distribution, the quantiles of interarrival times of taxi signals were plotted against those of various theoretical distributions (i.e., lognormal, Weibull and Gamma). Table 4 shows the parameters used for each distribution and the corresponding \(R^2\) value.

The \(R^2\) for lognormal and Weibull distributions are very close. However, the lognormal distribution has the best fit for the data with an \(R^2\) value equal to 0.9621. The corresponding Q-Q plot is shown in Fig. 9(a) and the plots for the CDF of interarrival times and the lognormal theoretical distribution are shown in Fig. 9(b). They both match very closely. Based on the \(R^2\) value from the Q-Q plot and CDF plots, we can conclude that the data best fits a lognormal distribution even though a Weibull distribution would be a good fit also.

Table 4. Fitting February 23, 2015 Chicago taxi signal interarrival time data
Fig. 9.
figure 9

Q-Q plot (left) and CDF plots (right) using the Chicago Millennium Park February 23, 2015 taxi signal interarrival times and theoretical lognormal distribution data with \(\mu \) = 0.241 and \(\sigma \) = 1.439.

4 Workload Characterization Use in Capacity Planning

As indicated above, workload characterization is an essential step for capacity planning purposes. Consider the following what-if question: How many fog servers are required to support a given load with an average response time below a certain value? We show here how we can answer this type of question using the NY City taxi workload. Let n be the number of fog servers that handle signals received from taxis within a one-mile radius of a given location. All arriving signals join a single queue and are dispatched to the first available fog server when they reach the head of the line.

The average response time of a taxi signal was computed using the approximate G/G/n queuing equation given below [26]

$$\begin{aligned} T \approx E[S] + \frac{C (\rho , n)}{c (1 - \rho ) / E [S]} \times \frac{C_a^2 + C_s^2}{2} \end{aligned}$$
(1)

where E[S] is the average processing time of a taxi signal, \(\rho = \lambda E[S] / n\) is the utilization of the set of n fog servers that receive a collective average arrival rate of \(\lambda \) taxi signals/sec, \(C_a\) is the coefficient of variation (i.e., the ratio of the standard deviation by the mean) of the interarrival time, \(C_s\) is the coefficient of variation of the service time, and \(C (\rho , n)\) is the Erlang formula given by

$$\begin{aligned} C (\rho , n) = \frac{(n \rho )^n / n!}{(1 - \rho ) \sum _{j = 0}^{n-1} (n \rho )^j / j! + (n \rho )^n / n! }. \end{aligned}$$
(2)

Because the utilization \(\rho \) must be less than 1, we have that \(\lambda < n / E[S]\), i.e., the average arrival rate cannot exceed n / E[S]. Our data showed that the maximum rate of signals received from taxis within a one-mile radius from Grand Central Terminal during the date of February 8, 2010 was approximately 4 signals/sec. We used the G/G/n equations above to compute the variation of the average signal response time as a function of the average arrival rate \(\lambda \) for five values of n (see Fig. 10). We used the following numerical values for Fig. 10: E[S] = 0.2 s, \(C_a = 2.88\), \(C_s = 0.94\) (from 2/8/2010 data). As expected, the figure shows that the maximum arrival rate of signals that can be handled increases in proportion to the number of fog servers. For example, when \(n = 1\), the maximum arrival rate the system can handle has to be less than 5 signals/sec whereas for \(n = 5\), the maximum arrival rate the system can handle has to be less than 25 signals/sec. Additionally, the average response time decreases as n increases for a given arrival rate. For example, at an arrival rate = 4.5 signals/sec the average response time with one server is 9.13 s whereas with 5 servers the average response time is 0.2 s. If we want the average response time not to exceed 1 s for an average arrival rate of taxi signals of 10 signals/sec we need at least 3 fog servers.

Fig. 10.
figure 10

Average response time vs. arrival rate for \(n=1, 2, 3, 4, 5\)

5 Related Work

Workload characterization studies have been conducted for various types of applications and systems. Some examples include: e-commerce [25], auction sites [5], WWW [24], networking [28, 30], live streaming media [39], spam traffic [19], storage systems [36], data centers [32], cloud computing [23], grid computing [14], memory systems [8], and database systems [16]. [27] quantifies a Poisson process approximation for IoT aggregate arrival processes. The studies above have shown that different domains have their own specific workload characteristics. Our paper fills a much needed gap in terms of understanding and characterizing IoT workloads.

The vision and challenges of edge computing were discussed in [9, 34]. There are some very good IoT and fog/edge computing surveys: a survey of mobile edge computing was presented in [3]; a survey of architecture, enabling technologies, security and privacy, and IoT applications was presented in [22]; and Ngu et al. presented a survey on IoT middleware [29]. Cruz et al. presented a reference model for IoT middleware [13]. [33] presents an IoT architecture based on transparent computing to build scalable IoT platforms. Transparent computing enables users to select services on-demand, without being concerned with the installation and management of services.

Similarly to [38], the work in [40] aims at reducing the response time of IoT applications by offloading the load of fog-capable devices to the cloud. Another work along the same vein is [10]. Fan and Ansari [17] presented an application aware scheme to allocate IoT-based workloads to edge servers in order to minimize the response time of IoT applications. The work in [4] proposes a method for reducing latency and device energy consumption using the fog, which is based on computational offloading and network utility optimization. The work in [18] presents a vision of human-centered edge-device based computing, known as Edge-centric Computing and the research challenges associated with its implementation. The work in [7] proposed a new technique called Home Edge Computing, a three-tier edge computing architecture that provides data storage and processing near the users (home server) to achieve ultra-low latency.

The work in [20] analyzed a motion dataset to characterize the kinetic energy that can be harvested by an IoT node and developed energy allocation algorithms for such nodes. The work by Pereira et al. [31] discusses an experimental evaluation of latency in IoT service composition with mobile gateways and assesses the capabilities and limitations of a standard machine-to-machine middleware. IoT devices with security flaws are attractive targets for attacks. [6] discusses HoneyScope, a network centric approach to protect vulnerable IoT devices by creating virtualized views of the network and nodes.

None of the studies cited above present a comprehensive workload characterization of actual IoT applications.

6 Concluding Remarks and Future Work

Understanding and quantitatively characterizing the workload generated by IoT devices is key to being able to analyze the performance of edge/fog computing environments. Our study analyzed three datasets that contain information generated by taxis in three big cities. Our workload characterization, which can be applied to other IoT workloads, included counts of events, i.e., IoT device signals, at various time scales (e.g., hour of the day, day of the week) and a characterization of the interarrival time of signals received from IoT devices.

Our results indicated that the interarrival time of IoT signals for all three datasets can be very well approximated by a lognormal distribution. We also observed that the count of events for the three taxi-related datasets can be well explained by expected daily routines of habitants of large cities. We also showed that workload characterization results can be used for capacity planning studies of edge computing environments.

In the future, we intend to apply our characterization methodology to IoT datasets that deal with other types of IoT devices. We are also investigating the sensitivity of our results with respect to the location of the fog node, and how it may affect the probability distribution and parameters of the request interarrival times.