Introduction

The network infrastructure is a key element for the cloud’s performance [13]. When the network is slow, some cloud-hosted services may be affected too [14]. OpenStack clouds allow administrators to customize the network configuration. A typical configuration divides the network into three security domains: public, guest, and management [18]. This configuration aims to guarantee basic traffic isolation and security along with the cloud network, which is necessary for preventing cloud administrative operations from causing a negative impact on the user’s network performance. For instance, the process of creating or shelving VMs instances may cause heavy administrative network traffic, which needs to be separated from the user domain [9].

Cloud administrators need to plan the cloud infrastructure correctly to avoid performance problems/bottlenecks. To cope with the infrastructure planning, OpenStack enables the customization of the distribution of the services within the data center. The servers and all service modules can be placed following the administrator’s objective (e.g., high availability, consolidation, and load balancing). In this context, this paper carries a network traffic analysis and characterization, which gives insights on how common VM management tasks (e.g., creating, pausing, and shelving) may affect the administrative network of an OpenStack-based cloud. This work stands as an extended version of [7]. Such a modality of paper brings space for better discussing fundamental concepts, the experimentation process, and approaching new experiments. Therefore, we highlight the contributions of this work to understanding how one could carry a network traffic characterization in an OpenStack deployment, which is a crucial step for resource planning in the cloud. In this extended version, we include a dataset [8] summarizing the results of our experiments, and we also investigate the impact of the VM’s flavor on the network traffic and present.

This paper aims at the lack of information regarding how user-generated tasks (e.g., creating an instance of VM) may impact on the most internal network domain of OpenStack. The main contributions of this work are: (i) the characterization of management network traffic based on VM-task-related (e.g., creating and shelving); (ii) experimental results considering multiple OS images and flavors; and (iii) linear regression to estimate network traffic (useful to cover not experimented scenarios and for bandwidth management).

This work is organized as follows. “OpenStack infrastructure” defines the network-related concepts of OpenStack clouds, and “Related work” discusses the related work. “Characterization methodology” presents the characterization method, while “Experiments and results” details the testbed, experimentation processes, and results. “Analysis” discusses the analysis, and “Considerations and future work” presents our considerations.

OpenStack Infrastructure

OpenStack controls a large pool of computation resources, acting as an operating system for the cloud [19]. To do so, OpenStack divides the management services into optional or core modules. Core modules represent the essential ones for operating the cloud. For example, the networking functionalities are held into Neutron module as well as Nova module holds computing services. Also, these modules interact with each other atop the data center network [21]. Inter-service communication is commonly performed through a messaging queue service, but REST requests can also be executed.

The message queuing services are essential for the cloud to operate in a distributed manner, providing efficient inter-process communication [20]. OpenStack supports RabbitMQ, Qpid, and ZeroMQ solutions. ZeroMQ (https://zeromq.org/) works with direct peer-to-peer communication through TCP sockets, while RabbitMQ (https://www.rabbitmq.com/) and Qpid (https://qpid.apache.org/) implement the Advanced Message Queuing Protocol (AMQP). Traditional OpenStack deployments use RabbitMQ. Figure 1 exemplifies how different services may interact/communicate to execute VM-related tasks.

Fig. 1
figure 1

Services interaction related do VM operation

The data center (DC) network design and configuration for OpenStack clouds may change according to the demand of the cloud administrator. Although there are several ways to configure a DC network for OpenStack, there are a few common points, which must be considered. OpenStack documentation states the division of the network traffic into security domains: public, guest, and management (Fig. 2) [17]. Moreover, some of the core OpenStack modules are:

  • Horizon (dashboard): used for cloud overview and management.

  • Nova (compute): handles mostly instance-related-tasks, e.g., initialization, scheduling, and deallocation of VMs;

  • Neutron (network): provides network connectivity all over the cloud;

  • Glance (image manager/storage): manages the storage and retrieval images of VMs and containers;

  • Swift (object storage): responsible for the storage and retrieval of unstructured objects;

  • Cinder (block storage): provides persistent block storage for running instances; and

  • Keystone (identity): responsible for authentication and authorization services.

Fig. 2
figure 2

Standard OpenStack networking setup [9]

The Public Domain is comprised of the Application Programming Interface (API) and External networks. The External network provides Internet access to VMs, while the API network is used to access OpenStack APIs. The Guest network is the one inside the Guest Security Domain, used by VM communication within the cloud deployment; and the Management Domain is the most internal security domain reachable only within the data center. The Management Domain is mainly composed of the Management network, although it could also include a Storage network. OpenStack components’ communication as well as the access to VM images and volumes, for example, are held over the Management Security Domain.

OpenStack provides users with several VM configuration options. The VM flavor describes the basic set of specifications about the VM. For example, one can define the storage volume for the OS, the RAM configuration, and the number of virtual CPUs. The VM flavor configuration must be carefully thought, considering the OS and the necessary resources for the machine to properly run. In turn, flavors can be tailored to CPU-intensive or RAM-intensive applications. The side effects of an imprecise flavor configuration may not be exclusive of the VM itself, as the data center network can also be impacted, since storage services, which will be holding snapshots, for example, are decoupled from the compute nodes. For example, if the user requests more disk space than the VM actually needs, disk-related operations will waste computing and networking resources.

Related Work

The cloud infrastructure analysis is often seen from the user’s perspective [1, 2, 4, 25], relinquishing the internal operations and behavior of the cloud provider. There is a lack of information regarding how user generated tasks (e.g., VM launch) may impact the behavior of the management network [9]. Besides, cloud performance can be evaluated by analyzing its behavior while its usage [3].

This paper offers the use of an analysis and characterization approach for the understanding of the network traffic into the provider’s management network regarding VM-related tasks performed by the user (e.g., creating, stopping, and shelving instances of VMs). This network traffic understanding helps cloud administrators to better design all the cloud architecture elements (e.g., network topology and bandwidth). In this sense, we defined five criteria which are used to compare this work to the other works in this area (Table 1).

Table 1 Related work comparison

In work [9], RabbitMQ traffic remained not characterized and there was a significant amount of miscellaneous (MISC) network traffic. In our previous work [7], it was shown the use of linear regression to predict the total network traffic volume produced by some user tasks for the management of VMs, as well as the significant amount of MISC (miscellaneous) traffic was reduced. In this paper, we focus on an extended version from [7], better detailing the experimentation process, presenting a new experiment with VM flavors, and deepening fundamental concepts. Finally, among the related work, [24] is similar to our proposal. However, the authors focused only on the network traffic generated by creating and destroying multiple VM instances in geo-distributed collaborative clouds, without separating traffic between services, nor do they try to identify the time to perform operations and the number of calls for each OpenStack service.

Characterization Methodology

Traffic and analysis characterization are techniques employed to understand and solve performance issues in computer networks [6]. Generally, these techniques involve two steps: (i) measurement: collection/measurement of data flowing through the network; and (ii) traffic analysis: to study the measured data. Analyzing the network traffic is an important step to identify/classify relevant characteristics, although it can be limited according to the employed measurement phase. Moreover, measuring traffic may assume employing tools to capture data traveling across the network (e.g., TCPdump). Depending on how measurement is performed, it can be classified as Active, as the monitoring approach impacts on the system being monitored or induces specific situations, or Passive, in which the monitoring does not influence the system [28].

Among the classification techniques (port-based, statistical, pattern matching, and protocol decoding) commonly used to classify internet traffic [5, 10], a port-based approach fits well when characterizing the OpenStack management network. Inside the context of the OpenStack management network, the services running are supposed to use well-defined ports (e.g., Nova API—compute services—uses port TCP/8774). The knowledge of these well-defined ports is also important when defining firewall rules. However, when deploying a naive port-based approach, the traffic generated by inter-services communication is masked as RabbitMQ network traffic (as we conclude from work [9]), since it uses RabbitMQ’s port (TCP/5672) and not the application port itself.

To properly address inter-service communication, we focused on mapping established connections to RabbitMQ. Once we know which services are communicating over RabbitMQ and what TCP port are they using, the port-based approach is still valid. Thus, we upgrade the first employed port-based approach by running lsof (GNU/Linux list of files) and mapping connections to RabbitMQ, helping to identify the processes listening to RabbitMQ during the network traffic collection. Moreover, we adopted an active measurement of the consumer operations on a VM instance. Since we found no information to serve as a baseline for operations on VM instances, we chose the Active approach and defined the sequence of operations, called here as induced VM lifecycle.

The induced lifecycle is composed of VM-related tasks that cause the instance to pass through a set of state changes. For example, when the user shutoffs a VM, the operation/task here is STOP, and the resulting state of the VM is STOPPED. The operations/tasks in the induced lifecycle are: (1) CREATE; (2) SUSPEND; (3) RESUME; (4) STOP; and (5) SHELVE. Therefore, the VM instance is (1) created, and then, its activity is (2) suspended, (3) resumed, 4 stopped (shutoff), and (5) shelved. The induced lifecycle starts with the VM instance creation and ends when the VM is shelved (meaning that it is stored for further use). Figure 4 depicts the induced lifecycle as well as the set of state changes involved in the process.

Tracking the state of a VM in real time is a complicated task, since it requires the knowledge of three different information: (i) ongoing tasks; (ii) current situation/status; and (iii) power (e.g., ON or OFF, RUNNING or SHUTDOWN). OpenStack maps ongoing tasks (i) as the TASK_STATE, indicating what is happening to the VM (e.g., SUSPENDING, RESUMING, and DELETING). Also, the TASK_STATE indicates a state transition, named based on the action being executed [27]. About the current situation/status (ii), OpenStack maps as VM_STATE, indicating a stable non-transition state (e.g., PAUSED, STOPPED, and SHELVED) [27]. On the other hand, the power (iii) is mapped as POWER_STATE, reflecting a snapshot of the hypervisor state, revealing if the machine is still running and if there was a failure (e.g., RUNNING, SHUTDOWN, and FAILED).

Fig. 3
figure 3

OpenStack VM’s states and transitions. DELETED and ERROR states are allowed to be reached from any other states [22]

The VM states depicted in Fig. 3 refer to the VM_STATE, representing the stable state. OpenStack has a total of 12 possible VM states [16]. However, by analyzing the operations of users on our private OpenStack cloud, we find out the vast majority of our users typically have their VMs in only 6 states, comprised by the induced lifecycle (Fig. 4). Moreover, we often use the term VM state to refer to the VM_STATE (stable non-transition state).

Fig. 4
figure 4

Induced VM lifecycle

Summarizing Fig. 4:

  • 1: Operation CREATE initializes the instance of VM (the instance goes from state INITIALIZED to ACTIVE);

  • 2: Operation SUSPEND suspends the instance’s activity (once the operation is done, the VM state goes from ACTIVE to SUSPENDED);

  • 3: Operation RESUME starts VM’s activity from where it stopped (the VM state goes from SUSPENDED back to ACTIVE);

  • 4: Operation STOP performs a shutoff (the VM state goes from ACTIVE to STOPPED); and

  • 5: Operation SHELVE stores the instance for further use (the VM state goes from STOPPED to SHELVED and, once the hypervisor releases the VM’s image, the final state hit is SHELVED_OFFLOADED).

Experiments and Results

In this section, we describe our experimentation methodology and results. Essential data are available through the dataset published on Zenodo [8].

Experiment Setup

CloudLab (https://www.cloudlab.us/) offers a flexible and isolated environment for research on cloud computing, and thus, it was chosen as our testbed for deploying OpenStack, Stein release. CloudLab provides researchers with 256GB RAM, and two 2.4 GHz processors servers. The OpenStack m1.small flavor, composed of 1 vCPU, 2 GB RAM and 20 GB storage, was set as a default flavor. All instances were interconnected by a 1 Gb/s network link. Figure 5 displays the deployment setup adopted over a two-node topology used in the experiments. The two-node topology is enough to configure and separate (using VLAN) all the network domains. Since this topology places several modules/services running on the controller node, the loopback interface is also relevant for monitoring.

Fig. 5
figure 5

Deployment setup adopted over two nodes on CloudLab testbed environment

The experiments are divided into: (i) OS image changing; and (ii) VM flavor changing. Experiment (i), OS image changing, tells us how the VM-related tasks behave (in terms of administrative traffic generated) according to the running OS in the machine. On the other hand, Experiment (ii), VM flavor changing, tells us whether the flavor choice itself arouses any network traffic impact. Both experiments use QCOW2-based OS images alongside KVM as the hypervisor (the default option in the CloudLab environment). However, adopting any other hypervisor (e.g., Xen), image file format, or OS image version does not impact the method or experiments. Nevertheless, on behalf of experiment replication, one should comprehend that using different image file formats may slightly change some procedures in the VM provisioning. Also, the experiments rely upon VM-related tasks described in the induced lifecycle (CREATE, SUSPEND, RESUME, STOP, and SHELVE), discussed in “Characterization methodology” and depicted in Fig. 4. Basically, Experiment (i) consists in performing the induced lifecycle against VMs running 10 different OSs, and Experiment (ii) consists in performing the induced lifecycle against VMs running the same OS but with different flavor configurations.

Experiment (i), image changing, uses ten different QCOW2-based OS images for instances of VMs:

  • FreeBSD version 12.0, 454 MB image;

  • GNU/Linux Fedora Cloud version 31\(-\)1.9, 319 MB image;

  • GNU/Linux Fedora Cloud version 32\(-\)1.6, 289 MB image;

  • GNU/Linux Ubuntu Server version 18.04 LTS (Bionic Beaver), 329 MB image;

  • MS Windows Server version 2012 R2, 6150 MB image;

  • GNU/Linux CirrOS version 0.4.0, 15 MB image;

  • GNU/Linux CentOS version 7, 898 MB image;

  • GNU/Linux CentOS version 7, 1300 MB image;

  • GNU/Linux Debian version 10, 550 MB image; and

  • GNU/Linux Ubuntu Server version 20.04 LTS (Focal Fossa), 519 MB image.

Experiment (ii), flavor changing, runs Ubuntu Bionic Beaver VMs created in four different flavors:

  • m1.small: 1 vCPU, 20 GB Disk, and 2048 MB RAM;

  • m1.medium: 2 vCPUs, 40 GB Disk, and 4096 MB RAM;

  • m1.large: 4 vCPUs, 80 GB Disk, and 8192 MB RAM; and

  • m1.xlarge: 8 vCPUs, 160 GB Disk, and 16384 MB RAM.

Automation Tools and Experiment Flow

To automate the experiments, we developed the OpenStack Network Monitor (ONM),Footnote 1 a tool that helps on measuring and analyzing the network traffic of OpenStack. ONM is divided into two main functions: (i) monitoring; and (ii) traffic analysis. It is possible to customize its operation, being able to fully parameterize the VMs (e.g., image and flavor), and specify a full set of VM-related tasks to be executed against the machines (we set the operations from the induced lifecycle), while the traffic monitoring runs (TCPdump-based). Moreover, ONM performs an analysis in the captured traffic, resulting in a database with relevant info about the traffic (e.g., the service/module and operation generating the traffic, size, and flow). Our tool also supports working with VM image cache, although it was not used in the experiments here described. The scheme in Fig. 6 depicts the experimentation flow.

Fig. 6
figure 6

Experimentation flow. The user informs the management network interfaces to ONM and the tool performs the induced lifecycle for each OS image. At the end of the process, a database is created holding all the useful info about the network traffic. The database is used for further study of the data (e.g., creating tables and plots)

ONM implements a moduleFootnote 2 that focuses on characterizing the network traffic from RabbitMQ (“Characterization methodology” introduces the challenges when characterizing RabbitMQ traffic). Since a naive port-based approach does not fit here, ONM also monitors RabbitMQ’s port (TCP 5672) through lsof. In this way, all the established connections to the RabbitMQ port can be properly mapped, resulting in an efficient port-based approach. Figure 7 shows how a default “lsof -i:5672” output looks like. The third column tells us which service established a connection to the TCP 5672, and the last column tells us the source TCP port.

Fig. 7
figure 7

lsof—list open files—Linux command used to map connections established to TCP 5672 (RabbitMQ port)

Summarizing the experimentation flow from Fig. 6, we use scripts to identify the management network interfaces, configure the environment, and configure ONM to run against these network interfaces and selected images. In the sequence, ONM performs the induced lifecycle and measures network traffic. The measured traffic is returned in the “.pcap” extension, which is used as input for the traffic analysis. Once the analysis is done, ONM returns a database with the relevant information only, such as packet source/destination, service, and timestamp.

Results

Table 2 brings a summary of data collected for all OSs in Experiment (i). Observed metrics are: (i) Elapsed time of the operation execution; (ii): Total network traffic generated by the operation; and (iii) Total number of API calls identified on each operation. Each operation (CREATE, SUSPEND, RESUME, STOP, and SHELVE) was executed 30 times for each OS image of VM.

Table 2 Data summary of the analyzed metrics

It is worthwhile to mention that VM-related tasks are basically handled at the Compute node, by the hypervisor, and the VM itself. For example, a SUSPEND operation just removes the VM out of memory and releases the vCPUs, but the image file remains on the Compute Node. Also, a STOP operation, for example, depends on the OS running on the VM to be performed. Each OS may implement a shutoff system call in their own way, resulting in the most varying scenario. Thus, it is common that this kind of operation does not imply heavy network traffic, since there is no significant OpenStack participation in handling the operation itself, other than delegating it to the responsible parts (Compute node/hypervisor). However, operations, such as CREATE and SHELVE, rely on the active participation of OpenStack modules (which will be responsible for holding the VM image or snapshot), resulting in traffic being captured for analysis. In other words, the experiments are intended to measure the participation of OpenStack (in terms of administrative traffic generated) in the VM-related tasks.

In Table 2, CREATE and SHELVE operations have the greatest impact on the volume of network traffic. This happens, because the image needs to be transferred from Glance, in the Controller Node, to the Compute Node (Fig. 5). Likewise, in SHELVE, the snapshot taken in Compute Node needs to be transferred back to Glance. Table 2 also shows that the total traffic of CREATE and SHELVE is not largely spread among the 30 observations (between 0.02% to Windows Server and 2.17% to CirrOS). The rest of the VM-related tasks (SUSPEND, RESUME, and STOP) do not imply intensive network traffic, only API calls and local operations in Compute Node. To focus on the OpenStack participation, we split up the metrics by OpenStack service. Table 3 allows us to see the traffic by service, and Table 4 approaches the API calls by service.

Table 3 Traffic volume (MB)/service (mean ± SD)
Table 4 API calls/service (mean ± SD)

Observing Table 4, the number of measured API calls may vary according to the implementation of the induced lifecycle and VM configurations, e.g., additional configurations on the network would cause an increased number of Neutron-API calls. We automated the experiments using OpenStack Python APIs to handle VM-related tasks, as well as OpenStack Connection. Compute [15], but one could find another way to do so. The variation of API calls obtained was between 1.3% (CentOS) and 12.75% (Ubuntu Bionic Beaver), although CREATE operation of MS Windows Server has an Standard Deviation (SD) value a bit higher when comparing to the others SD values for all the operations and OS images. The highest SD value, among all operations, was measured for CirrOS and MS Windows Server in SUSPEND operation, 36.2% and 34.9%, respectively.

From Table 3, it is evident that Glance, the module responsible for managing the VM images, represents most of the network traffic, as well as CREATE operation (for all images) produces the amount of Glance traffic around to the image size. For instance, the Glance traffic measured for CREATE operation using MS Windows Server is around 6615.876 MB, and the image size of MS Windows Server is 6150 MB, confirming the transmission of the image through the network. Therefore, one can assume the amount of administrative traffic as the total measured minus the image size. Proceeding with MS Windows Server CREATE example, 6645.582 MB (Table 2) − 6150 MB (image size) = 495.582 MB of administrative network traffic. SHELVE operation also takes the same logic, although the file transferred through the network is a snapshot, not an OS image. The remaining operations do not produce massive network traffic and run on few seconds.

Figures 8 and 9 provide a complimentary evaluation of the network behavior during CREATE and SHELVE operations. Figure 8 shows boxplots for the data flow per second (MB/s in log10 +1 scale). The boxplot assists in visualizing how spread the data are by dividing its “body” into four quartiles (dots represent outliers). For instance, the boxplots for Cirros and Centos 7 in CREATE operation look symmetric (equal proportions around the median), which means that it is a normal distribution. However, some identified outliers suggest peaks in the network traffic, possibly indicating the time when the image is transferred from one node to another through the network.

Fig. 8
figure 8

Box plot of traffic (MB) per second of each OS image

On the other hand, one may observe a positively skewed distribution by analyzing the boxplots for MS Windows CREATE and SHELVE (Fig. 8). In this case, the mean is higher than the median. Also, there is such a frequency of high values (high data flow in the network) to the point of eliminating the outliers. Thus, the network is at a high rate during most of the operation, indicating the process of sending around 6 GB of data corresponding to the OS image. Additionally, the cumulative distribution provided in Fig. 9 reinforces the growing network traffic during about 45–50% of the operation.

Fig. 9
figure 9

CDF plot of traffic per second for each OS image

Figure 9 shows that, for most OSs (apart from MS Windows Server and FreeBSD 12), the network behavior is constant for around 80% of the CREATE operation and 75% of the SHELVE operation. What happens to FreeBSD 12 is similar to the previously analyzed scenario on MS Windows Server. FreeBSD 12 has an execution time for SHELVE of around 15 s (Table 2). The only OS with a lower execution time for this operation is Cirros (about 7 s to a 15 MB image). SHELVE operation takes more than 30 s for all the other OSs, and MS Windows takes about 137 s, for instance (Table 2). In addition, Fig. 8 shows that FreeBSD also registers a high frequency of high values (positively skewed distribution in SHELVE operation). Nevertheless, it occurs within a shorter period. Figure 9 confirms the analysis, showing constant values during 68% of the operation (meaning a high data rate during 32% of the execution time for SHELVE).

We set up a linear regression model to study the relationship between the size of the image and the total traffic created by the operation. The linear regression model allows to understand the growth of the network traffic as a function of the image size. Therefore, the image size represents the predictor variable for the network traffic, which is the target/response variable. Figure 10 shows the linear regression models for operations CREATE and SHELVE, \(y = 40.700864 + 1.002707x\) and \(y = 425.4478 + 0.9401x\), respectively. Y stands for the response variable (network traffic volume) and X stands for the predictor (image size in MB). We employed nine OS images and one other image to compare the predicted value by the model to an actual measured value. The OS image used in the comparison predicted vs. actual is chosen randomly.

Fig. 10
figure 10

Linear regression model for CREATE and SHELVE operations. Image size is the predictor and network traffic is the target/response variable

We found good accuracy responses for CREATE operation (Fig. 10), such as a Min Max Accuracy (MMX) = \(93\%\) (approximately) and Mean Absolute Percentage Error (MAPE) = \(7\%\) (approximately). We adopted a confidence level of \(90\%\); the identified values of intercept and slope: 40.700864 and 1.002707. Slope coefficients suggests that there is a strong relationship between image size and network traffic (Pr value of \(4e-14\)). Pr shows the probability of observing extreme values leading to coefficients of value 0 (called null hypothesis). If Pr is low enough, we can discard the null hypothesis. Thus, when the value of Pr is significant, it can be stated the null hypothesis is discarded. Regarding intercept coefficients, the relationship between image size and network traffic is not so strong despite still valid (Pr value of 0.0139); strongly significant R-squared and p value: 0.9998 and \(3.997e-14\); and residual standard error of 32.3 MB on 7 degrees of freedom.

Regarding the linear model for SHELVE operation (Fig. 10), we do not achieve high levels of accuracy: MMX = \(61\%\) and MAPE = \(39\%\). A \(90\%\) confidence level is adopted; we identified intercept and slope values of 425.4478 and 0.9401. Slope coefficients suggest that there is a strong relationship between image size and network traffic (Pr value of \(1.43e-06\)). Intercept coefficients suggests a valid relationship between image size and network traffic (Pr value of 0.0208). R-squared and p value of 0.9697 and \(1.425e-06\), both significant for the context; and residual standard error of 366.5 MB on only 7 degrees of freedom. Overall, both linear models provide a direction of what to expect from the network traffic volume when performing CREATE and SHELVE operations. Moreover, even with satisfactory results for the context, it is evident that a larger dataset could lead the models to a better statistical validation.

Another scenario worth investigating is the network traffic generated when the OS image remains the same, but the VM’s flavor is changed, which is our Experiment (ii). As mentioned in “OpenStack infrastructure”, the flavor of the VM specifies a basic set of configurations for the machine. Therefore, this scenario helps in understanding if the flavor choice may impact the network traffic volume. Table 5 summarizes the results for Experiment (ii), comparing metrics Total Traffic (in MB), API Calls, and 10 different QCOW2-based OS images for instances of VMs: for instances of VMs created under the flavors m1.small, m1.medium, m1.large, and m1.xlarge.

Table 5 Summary of results for Experiment (ii). The network traffic does not change according to the flavor

From Table 5, it is evident that the traffic volume does not change according to the flavor itself. However, one could measure the traffic volume by applying some load for memory and/or disk to the VM. Applying some load to the VM would actually make use of the resource allocated by the flavor and, perhaps, one could see the traffic volume increasing according to the flavor mostly for operation SHELVE. However, this is outside of the scope of this experiment.

Analysis

Altogether, the experiments are designed to associate the image’s size with the resultant network traffic for each operation in the induced lifecycle (Fig. 4). In addition to that, the experiments also allow us to measure the baseline management traffic, which considers only the network traffic strictly necessary for OpenStack to handle the operation requests (excluding tasks such as image transfer through the network). Therefore, such a method is replicable regardless of specific characteristics of the OS images, such as the version. On the other hand, the more images with different sizes, more accurate the results.

From experiment (ii), we confirm that varying the VM flavors does not significantly change the resultant traffic by the operations stated in the induced lifecycle. Such flavors serve as designs for instances created from them, setting configuration parameters such as the number of vCPUs, available RAM, and Disk space. Therefore, such specifications do not significantly affect the network load. However, creating a snapshot from a running VM instance with allocated resources (e.g., memory and disk) could lead to increased traffic in the network, reflecting the snapshot’s transmission, which could be investigated in future works.

SUSPEND, RESUME, and STOP operations do not result in heavy network traffic. These operations mostly rely on system calls and tasks performed on the hypervisor. Therefore, the output network traffic depends mainly on the elapsed time. This can be confirmed in Table 2, since different VMs yield a similar amount of traffic per second. On the other hand, CREATE and SHELVE operations produce most significant amount of network traffic. The network traffic measured was classified according to the service it belongs to. Also, the inter-service communication traffic, previously masked as only RabbitMQ traffic, is now mapped into its respective services using the lsof tool. There is still a small amount of MISC traffic (Table 3), which is related to MySQL, since several OpenStack modules were contemplated in the classification. Moreover, the number of API calls depends on how the operations are performed (e.g., implementation using Python APIs, CLI). Also, each operation does not require a constant number of API calls. In fact, this may vary depending on configurations, such as the number of network interfaces on the VM instance.

Creating or shelving VM instances demands a considerable amount of bandwidth, depending on the size of the images. Therefore, content caching is a desirable providence to remain only the management traffic. For instance, such operations may cause the network to clog up if content caching is unavailable and the network is not well designed (e.g., lack of resources, minimal topology). Usually, a content caching joint with a dedicated storage network is the preferred approach. However, such means do not exclude the necessity for resource planning, which avoids under/over resource-provisioning, providing reliability whenever content caching is impossible (e.g., first-time creation, VM instance replica for increased reliability). Therefore, estimating the resultant traffic for VMs creation and shelving are still the most viable approach considering resource planning and network design.

Considerations and Future Work

The present work contributes to a method for characterizing the network traffic in OpenStack’s management domain. We perform network monitoring based on some of the most common operations to VM instances, such as creating and stopping them. Our data are available through a summarized dataset published on Zenodo [8]. Additionally, we analyze the impact of such operations on the network and provide a linear regression to predict the resultant network load for creating and shelving instances based on their OS images.

The network traffic characterization allows administrators to understand certain behaviors and plan resource allocation. It is especially challenging to characterize the inter-service communication traffic on OpenStack, since such an inter-service interaction is masked under RabbitMQ traffic. Thus, this paper also presents an alternative to identifying and mapping services communicating over RabbitMQ. Finally, we also point out the possibility for future work on measuring how a snapshot of a working VM with allocated and in-use resources could affect the network traffic and performance.