Scientific interdisciplinary research in various fields of science involves the cooperation of several geographically separated research groups and access to scientific collections of data and high-performance computing resources, which requires an information, computing, and communication infrastructure accounting for the specifics of each project. Its construction in the traditional way (a local cluster of servers, or a data processing center—a DPC with means of computer virtualizationFootnote 1 and a data transmission network) is associated with a number of problems:

• It requires significant financial and material investments and highly qualified IT specialists. The problem becomes even more complicated if experiments in the project are conducted by independent research teams that use different working methods, often have different internal business processes and hardware and software preferences, and may be located far from each other. In addition, experiments in different subject areas can take different amounts of time.

• At the initial stage, requirements on the infrastructure of the project are known approximately and, as a rule, are overestimated, which leads to a loss of investment efficiency.

• Work becomes more difficult when scientific data are distributed and are used by different teams at the same time.

• Data needed by one team may belong to another; without a specialized infrastructure management system, such issues are difficult to regulate.

• Different research teams of the same project may use different tools and software for processing, collecting, and storing data. The creation or mastering of new tools is usually unacceptable for them; therefore, it is necessary to provide an opportunity to bring already implemented developments into the information computation environment of the project.

High-Performance Computing (HPC) has always been the primary tool for numerical experiments and modeling. Computing resources for them are provided by supercomputers and specialized server clusters. However, the analysis of the data presented on the site of supercomputing systems TOP500.org [1] suggests that the number of scientific applications is growing faster than the number of supercomputers and high-performance computing units.Footnote 2 At the same time, we see a rapid rise in the popularity of cloud computing, with the use of Data Center Networks (DCNs) to increase the computing power of cloud computing platforms. The EGI association [2] is a good example of this.

Supercomputers (HPC-S) and DPC cloud environments, or the HPC Server Cluster (HPC-C), have different computational capabilities and utilization. Most applications will run faster on a supercomputer than on a server cluster in a data center. However, the total time to obtain the result of a computational experiment, that is, waiting for the application in the queue plus the execution time on the HPC-S supercomputer, may be longer than the execution time plus the waiting time in the queue in the cloud on the HPC-C server cluster. We will call this total delay program time in the system (PTS criterion).

The mission of the MC2E project was to investigate how an environment should be organized to make it possible to create an effective information and computing infrastructure meeting the listed features of a specific interdisciplinary project. One of the still little-studied problems of this mission is the integration of two rather different high-performance computing environments—supercomputers (HPC-S) and cloud environments (HPC-C), differing in many parameters: the level and methods of resource management, virtualization techniques, the composition of parameters and the form of specifying program execution request, scheduling, and resource allocation policy. Thus, characteristic of resource allocation in an HPC-S environment is the use of reservation and isolation methods, that is, when different calculations cannot share the same physical resource. HPC-C environments are prone to provisioning on demand. A cloud environment offers more flexibility and convenience for working with resources than HPC-S, providing virtualized resources that are customized for specific purposes. The above-listed differences between the HPC-S and HPC-C platforms make it difficult to switch tasks between them automatically if either of them becomes heavily loaded. Thus, to change the target platform, researchers need to spend time and resources tuning up their software.

Another challenge on the path to integrating HPC-S and HPC-C is how to choose the right computer in a heterogeneous integrated environment of two platforms for executing an MPI program,Footnote 3 using the MPI (Message Passing Interface) library—a tool for interaction between parallel branches of computers of the same program. In other words, for each MPI program in the queue, it is necessary to decide where its execution will be more efficient from the point of view of the PTS criterion. Dealing with the problem of integration, it is necessary to substantiate the hypothesis that the joint use of physical resources in an HPC-C environment by several MPI programs will simultaneously reduce the total execution time; that is, this time will be less than the sum of the sequential execution times of each.

Another difficulty lies in the ability of the environment to aggregate the resources of the data transmission network between DPCs in the DCN (Data Communication Network). The key problem here is the implementation of the Bandwidth on Demand (BoD) service. This service must allocate, upon request, channels between two or more DPCs with the appropriate quality, that is, capable of transmitting a certain amount of data in a certain time interval through a data transmission network, for example, the Internet. Note that such a service does not imply a dedicated channel between interacting computers. Moreover, it should be created dynamically by combining existing network resources (transmission lines, switches, etc.).

After the introductory theses, let us present the main results of the MC2E project: principles of building the structure of the MC2E environment; an experimental study of the influence of a network in a DPC on the sharing of processors in the clouds and the possibility of joint use of these resources by MPI programs; new approaches to predicting the execution time of programs on a certain set of computers to determine the most efficient place for program execution; approaches to the implementation of the BoD service. For a detailed report on the project, see [3].

THE MC2E STRUCTURE

Let us list the basic principles of organizing the MC2E environment:

• The infrastructure is a combination of locations equipped with computers, storage, and network resources, called federates.

• The federation controls all resources (central processing unit, memory, data network, software) provided by the federates.

• The resources of one federate can simultaneously be used by different projects.

• Federated resources can be virtualized.

• Resources at the user level are characterized by a high level of abstraction, and their use should not imply high user qualifications.

• The results of experiments should always be stored and can be used by others to reproduce or continue the experiment.

• The federation provides data processing services; in other words, a federation is a computer.

Figure 1 shows the organization of the MC2E environment with the distribution of responsibilities among the international project participants.

Fig. 1.
figure 1

MC2E structure.

An example of a federated computing resource can be a high-performance computer or a DPC. Each federation has its own policy that governs the allocation of resources among project participants. The idea of building an information and computing infrastructure as a federation of heterogeneous computing facilities has already been used in many existing projects. Some of them were intended for experiments in the field of computer networks. For example, the GENI project—Global Environment for Network Innovation [4], initiated by the US National Science Foundation (NSF)—is a virtual laboratory for network experiments on an international scale. Today, over 200 US universities are connected to the GENI environment. Another NSF-supported project, FABRIC [5], offers an adaptive programmable infrastructure for computer science research. Similar but less-well-known projects include Ophelia [6] and Fed4Fire [7]. The former was implemented with the support of the EU, while the latter was supported by 17 companies from eight European countries. There are also others that provide an environment for computational experiments regardless of the applied field [810]. However, all these projects have several significant disadvantages:

• poor integration between HPC-S and HPC-C, which does not allow automatic selection of an effective computing resource;

• the inability to build a chain of services for conducting experiments;

• the impossibility to transfer existing software automatically to a new environment;

• resource scheduling does not account for the scaling of service performance;

• lack of DCN control and management services such as monitoring and BoD services.

The basic technologies for the MC2E project in the development of a virtual infrastructure for interdisciplinary research were software-defined networking (SDN) and network function virtualization (NFV) [11]. These technologies make it possible to increase the level of abstraction of resources and ensure their consistent optimization and automation of infrastructure management. Instead of separate resources, users get virtual infrastructures fully equipped with the necessary resources—computing power, communication channels, and storage systems—with guaranteed performance and quality of service (QoS) based on a service level agreement (SLA).

In fact, the MC2E project used two cloud environments: Docklet [12] and Cloud Conductor (C2) [13]. The goal of Docklet is to provide a personalized workspace in the cloud [14]. Docklet is based on the LXC virtual cluster [15]. Docklet users interact directly in their workspace, using the browser to develop, debug, and test their software with the help of the tools that the environment provides. Docklet makes it possible to build customized small virtual datacenters by creating virtualized clusters and then providing experimenters with a customized workspace in the cloud. Users only need a modern browser to access their federated workspace from anywhere on the Internet, anytime.

The architecture of the C2 platform is based on the reference implementation of the ETSI NFV MANO model [13]. C2 provides full support for the life cycle of virtualized services (VSs)—initialization, configuration, execution, and deinitialization, which is carried out using the so-called templates in the TOSCA language [13]. All that is needed to install and start a VS is a TOSCA template, which includes a description of the structure of the cloud application; the application management policy; and the image of the operating system and scripts for starting, stopping, and configuring the application that implements the VS. TOSCA is formed by the service administrator as a zip or tar archive.

Thus, a federated environment has the following advantages:

• the simplicity and speed of allocation, customization, and scaling of resources;

• application development as a chain of services based on the NFV technology;

• the unification of infrastructures of different research groups and the coordination of their access policies;

• automated resource scheduling to fulfill user requests based on access policy and SLA requirements;

• a templating language for describing applications that allows one to abstract oneself from low-level system details;

• a decentralized resource accounting system for mutual settlements between project participants;

• ample opportunities for tracking and monitoring experiments;

• a high efficiency of network virtualization thanks to the technology of software-defined networks (SDNs), which makes it possible to configure virtualized network channels for each specific experiment;

• a common specification language required for porting existing research software to the MC2E environment.

Let us list the main components or subsystems of the MC2E environment:

• meta-cloud, the orchestration of custom applications, their distribution and scheduling among federates;

• interface, which provides users with a unified API for submitting their applications and a federated administrator for the purpose of managing and controlling federation resources;

• network, which regulates the use of network resources and provides the BoD service;

• monitor, which monitors and records the resource consumption of all federates in the MC2E;

• quality of service and administrative control and management, which ensure compliance with the resource use policy based on user requirements (SLA) and guarantee the reliability of the resources.

These components are described in detail in [16]. The general process for executing a user request in MC2E is as follows:

(1) Using a single MC2E interface, the user sends the application and data to the front-end server.

(2) The front-end server calls the meta-cloud scheduler and monitor to select a federate to run the application.

(3) The meta-cloud analyzes the queue and predicts application execution times and data transfer times for all available federates.

(4) Based on the forecast, the meta-cloud chooses a federate that minimizes the total time the application spends in the system (the PTS criterion is the waiting time in the queue + the execution time).

(5) The meta-cloud contacts the network control loop to lay the channel to the federate selected in step 4 and sends the application and its data there.

(6) The federate launches the application and returns the results to the user.

(7) In the event of a failure, the QoS monitoring and management subsystem redirects the application using the meta-cloud to another federate.

THE CLOUD AS A HIGH-PERFORMANCE COMPUTING ENVIRONMENT

Even though the computing speed in cloud environments is lower than in specialized server clusters or supercomputers [17], this technology is becoming increasingly popular and is chosen as a platform for high-performance computing due to its low cost and ease of access. Several articles [18, 19] show that one of the main bottlenecks in HPC-C performance is related to delays in the data transmission network (DTN) in the data center. Supercomputers use fast DTNs based on specialized tools [20, 21]. HPC-Cs mainly rely on the conventional Ethernet. DTN delays can lead to underutilization of processors (CPUs) by applications with a high data exchange rate since they can experience long “pauses” in computations because of waiting for data transfer over the network between application branches.

One of the important results of the MC2E project is the experimental substantiation of the hypothesis that the performance of HPC applications may degrade slightly when the CPU cores are shared. This hypothesis was tested within the MC2E project using the HPC‒NAS Parallel Benchmarks (NPB) [22] in a cloud environment of a mini data center with a QEMU/KVM hypervisor with 64 virtual machines (VM) (Ubuntu 16.04, 1 vCPU, 1024 MB RAM), MPI version 3.2. The master server had 16 VMs, the rest having eight VMs each. The average latency between different virtual machines is about 400 μs. The bandwidth was 18.2 GB/s on one server and 5.86 GB/s on different servers.

A series of experiments was conducted to investigate the effect of the network bandwidth on CPU utilization. Five NPB MPI programs with 2, 4, 8, 16, 32, and 64 processes on each virtual machine were consecutively launched in a network with three channel bandwidths: 100 MB/s and 1 and 10 GB/s. The experiments showed an almost linear dependence of the degradation of the CPU load on the increase in the number of MPI processes. This happened because different MPI processes were executed on different VMs, and data were transmitted between different servers. CPU utilization also dropped when the program was running on the same physical server (2, 4, and 8 CPUs). Reducing CPU usage is exactly what allows the cores of the same CPU to be used between different programs. This series of experiments with graphs is presented in detail in [16].

Another series of experiments examined the ability of different HPC applications to share cores of the same CPU. The experimental methodology was as follows: five pairs of identical NPB MPI programs with the same number (2, 4, 8, 16, 32, and 64) of processes in each were studied sequentially. To assess the ability of a program to share CPU resources, the following metric was used:

$${\text{Queue metric}} = \frac{{T_{{{\text{pure}}}}^{1} + ~T_{{{\text{pure}}}}^{2}}}{{\max (T_{{{\text{sharing}}}}^{1},T_{{{\text{sharing}}}}^{2})}},$$

where \(T_{{{\text{pure}}}}^{i}\) (i = 1.2) is runtime without sharing resources and \(T_{{{\text{sharing}}}}^{i}\) (i = 1.2) is runtime when two programs were using the same cores of the same CPU.

Obviously, if the metric value is greater than 1, the execution of two programs launched at the same time will take less time than when they are launched sequentially. The experiments showed that even in an environment with a slow network (100 MB/s), one can get up to 20% faster execution. However, not all programs can efficiently share physical resources. This series of experiments is described in detail in [16].

PREDICTING PROGRAM EXECUTION TIME

Recall that one of the main criteria for the efficiency of a cloud computing environment for HPC applications is the time spent by the program in this environment (PTS criterion). This value depends on the algorithms for resource allocation in the cloud environment (mapping of virtual computers to physical ones) and the discipline of servicing the program queue, considering the heterogeneity of physical computers.

From the analysis of the user request execution process presented at the end of the “MC2E Structure” section, we can see that stages 3 (metaplanning) and 4 (local planning) are optimization points by the PTS criterion. An important part of the MC2E project was the R&D of the algorithm, with the selection of the most efficient computer in the federation for a specific program based on the criterion of the minimum execution time—an important component of the PTS criterion. To this end, specialists studied and developed several algorithms for predicting the execution time of a program on a certain set of computers with account for the history of program execution on different computers (for a detailed description of the algorithms, see [23]).

The problem of predicting the execution time of a program on a certain computer is well known and belongs to the classical ones. For example, the execution time of a program on a particular computer, as well as the waiting time in a queue, can be predicted based on the history of its runs on this computer [2427], for which one can use many algorithms of extrapolation, for example [28, 29], or regression [30], or more complex algorithms, in particular, an ensemble of decision trees (random forest) [31]. The main disadvantage of these algorithms is that they can only be applied to the same computer. Yet the essence of this task in MC2E was to predict the execution time of a program on a certain set of computers. Of course, the above-mentioned algorithms can be used to estimate the execution time of programs on several computers. However, this requires the history of the launch of each program on each computer from the set. This information is unavailable.

The choice of the computer is carried out as follows. One of the well-known algorithms [2427] estimates the execution time of a program from the histories of its execution on each computer from a certain set. An example of such a history can be the program execution trace [32]. Based on the data obtained, one can either build a schedule for a group of programs or, following a greedy strategy, send each program to a computer where it has the minimum execution time. However, this requires all the execution histories of all programs on each of the computers in the set under consideration. As a rule, there is no such information.

One of the important and new results of the MC2E project is the development of a method for predicting the program execution time, which makes it possible to weaken the requirement for data on all execution histories of all programs on each of the computers in the set under consideration: to predict the program execution time on computers from a certain set, only the execution histories of this program on some of them are necessary (for a detailed description of the method, see [23]). In other words, there is no need to run every program on every computer.

The main idea of the new approach to the problem of predicting the program execution time was that the problem under consideration is like that solved in recommendation systems, or recommender systems (RSs) [33]. An RS belongs to the subclasses of an information filtering system that seeks to predict the rating or preference of the user of some object [34]. Examples of such an object could be films, books, or other goods. Thus, the recommender system restores and predicts the relationship between users and products based on individual user preference ratings.

An RS has a rating matrix where rows (or columns) correspond to movies, books, or goods, and columns (or rows) correspond to users. This matrix is often sparse because there are many users and items in the list, which prevents users from physically evaluating all of the items in question. The system tries to predict the preferences of each user for each item based on individual user ratings for some items. In other words, the RS must fill in the empty cells in the rating matrix. In these terms, consider the following analogy: users are computers, items are programs, and user ratings are program execution times. Thus, the computers “evaluate” the programs, and the lower the rating (execution time), the better.

As a result, the problem of predicting the program execution time was reduced to the problem of filling empty cells in the “Program–Computer” matrix, built for a given set of programs and a given set of computers. The cells of this matrix indicate the execution time of a specific program with specific data sets corresponding to a specific computer.

Clearly, the prediction accuracy depends on the number of known execution histories of programs on a computer from a certain set. We have investigated two approaches to this problem. The first one was based on a grouping of computers based on the Pearson correlation [35] and showed [23] that this approach is expedient to apply in the case of a densely filled (at least 95% of cells) “Program–Computer” matrix.

The second approach, developed for a sparse “Program‒Computer” matrix, meant its decomposition into vector representations of computers and programs—so-called embeddings. Embeddings are elements of a relatively small space into which larger vectors can be transformed. The embedding technique simplifies machine learning for big data cases such as sparse vectors representing words. Ideally, it partially reflects the semantics of the data by placing semantically similar objects close to each other in the embedding space [36]. In [23], it is shown in detail how to use embedding programs and embedding computers to predict the execution time of a program on a particular computer. The technique of decomposing a “Program–Computer” matrix to calculate embedding is given in [37].

The proposed approach to predicting the program execution time requires minimal knowledge about the program; it is usually collected on all modern computers. Another important advantage of the approach is the ability to decompose the “Program–Computer” matrix to provide embeddings for both programs and computers of dimension 1. This fact makes it possible to order totally both computers and programs, which greatly simplifies the selection of a computer with an effective execution time. For more details on testing prediction methods, see [23].

“BANDWIDTH ON DEMAND” (BoD) SERVICE

As was noted at the beginning of this article, one idea investigated in the MC2E project was the problem of dynamically creating a channel between data centers on demand with a given communication quality. An environment designed for interdisciplinary research should have flexible and powerful mechanisms for allocating, scheduling, and administering network resources. Otherwise, the overhead costs of network resources in the data center network will be very high since the load on this resource is sporadic. Naturally, the question arose about the possibility of creating a BoD service capable of forming a channel of proper quality between data centers on demand, that is, transmitting a certain amount of data over a certain time interval through the TCP/IP transport network. It should be emphasized that such a service does not imply a dedicated channel between the interacting parties. Moreover, it should be created dynamically by aggregating existing network resources.

Article [38] details the protocols for aggregating network resources and algorithms for creating a BoD service. Only a logical diagram of the approach proposed in the article cited is presented here. The creation of a BoD service can be divided into two main parts: the identification and aggregation of routes in the data center network and the distribution of data flows between them according to the BoD quality-of-service requirements for each flow.

The words building and aggregating routes mean that, to transfer the data flow between data centers, different routes are used at the same time and several physical lines are aggregated into a logical one. There are many protocols and technologies that make this possible [38]. However, they all leave the problem of quality of service unsolved.

To build routes, it is necessary to solve several problems.

First, how many and what routes are needed to meet BoD’s quality-of-service requirements? Routes should have minimal overlap between links and switches/routers. This limitation stems from the way congestion control algorithms work at the transport layer in the network. The number of edge-disjoint paths between two vertices in a graph is determined by Menger’s theorem [39], which states that the largest number of edge-disjoint routes from a vertex u to a vertex v in a graph is equal to the least number of edges in the section <u, v> of this graph.

Lack of crossing routes is not always critical. For example, if a physical line at an intersection has sufficient capacity to meet the quality-of-service requirements for all flows passing through it, then this line, being an intersection of routes, is not a critical point. It is possible that there are no disjoint alternative routes between the source and destination in the network topology. However, if you have high-bandwidth physical links, you can transform the network topology graph so that an edge corresponding to such a line is replaced by several edges with lower bandwidth. An alternative solution may be to search for routes with the least number of intersections, as is done by the algorithm Min Cost Max Flow—MCMF [40].

Upon finding the value k, i.e., the number of nonoverlapping routes, the network topology graph is processed by a special algorithm to determine k-routes between the source and the destination. It turned out that not every algorithm is suitable for this. For example, the greedy algorithm [41] will not solve this problem. The MCMF algorithm [42] turned out to be the correct choice. It reduces the original problem to finding the maximum network flow. As a result, a set of routes with sufficient resources to provide the BoD service will be identified.

The second question is how to distribute the resources of these routes when transmitting an application flow, that is, how to solve the AFLD (Application Flow Load Distribution) problem. It was divided into three parts: resource estimation, resource allocation, and the implementation of the calculated allocation between the routes found. The solution of the first one provides an answer to the question whether the available bandwidth of the identified routes is sufficient to provide the required quality of BoD service. If it is, then the solution to the second part of the problem provides an answer to the question of how to distribute the load of the application data flow between these routes. When solving the third part, we deal with multithreaded protocols, that is, those using several routes at the same time. These protocols split the application flow into multiple transport subflows, each of which uses its own route.

There are two main approaches to multithreaded routing at the transport level: static and dynamic. MPTCP is a static approach, [42] which assumes a priori allocation of a certain number of transport subflows (that is, routes), between which segments of the application flow are allocated. A dynamic approach, such as FDMP [43], uses dynamic route allocation for a subflow at the request of a transport agent, depending on how the total bandwidth of the current set of subflows matches the quality-of-service requirement.

To solve the AFLD problem, a mathematical model was developed for multithreaded data transfer on demand between data centers under the following assumptions:

• to use the BoD service, a pair of data centers must conclude a contract, which stipulates the maximum allowable volumes of transmitted data and the maximum allowable time for this, the quality of communication, etc.;

• under the same contract, it is impossible for two or more requests for the BoD service to appear simultaneously;

• the probability of distribution of flows of requests for a service for each contract is known.

Based on this model, the AFLD problem was formulated as a discrete-time integer linear programming (ILP) problem [16]. Its solution in the form of the ILP provided answers to the following questions: Is the available bandwidth on the identified routes sufficient to satisfy all request flows for a given set of contracts at a certain time interval? How should the bandwidth on each route be allocated between on-demand flows?

Based on Juniper VMX routers, a prototype of the above approaches to the implementation of the BoD service was built on the basis of the VPN technology, tested between data centers at the Faculty of Computational Mathematics and Cybernetics of Moscow State University and the data centers of Peking University [3]. In April 2021, a pilot project for the implementation of BоD service between data centers in Moscow and Novosibirsk was successfully completed [44].

* * *

The international project MC2E is aimed at investigating methods for constructing an environment for academic interdisciplinary research. The MC2E environment was built as a federation of local computing environments called federates. Each federate is a high-performance cluster—either a data center or a supercomputer. The advantages of the proposed method include the following:

• a high level of resource management and flexible options for defining virtual environments;

• the opportunity to use the existing software of the researchers;

• high quality planning and the efficiency of resource use according to the PST criterion;

• freeing the user from routine system administration tasks, as well as defining a unified way to describe the life cycle of a virtualized service in a data center (or in an HPC cluster).

In the course of the project, the hypothesis was experimentally substantiated that it is possible to reduce the average execution time of programs in the cloud. Experiments have shown that, for certain classes of applications, one can reduce the time by up to 20%.

A new solution to the problem of choosing a suitable computer for an MPI program in a heterogeneous environment based on the PST criterion has been developed. To this end, a new approach to predicting the execution time of an MPI program on a computer has been proposed, even if it might never be executed on it. Two algorithms have been constructed and analyzed: one based on a grouping of computers using the Pearson correlation (for a dense “Program–Computer” matrix) and one based on the decomposition technique of such a matrix, which makes it possible to obtain vector representations (embedding) of a program and computer (for a sparse “Program–Computer” matrix).

Note that the proposed approach to predicting the program execution time requires the minimum generally available set of data on the program execution history. Its other important advantage is that, as a result of matrix decomposition, embedding is provided both for programs and for computers of dimension 1. This fact makes it possible to establish a total ordering both on a given set of computers and on a given set of programs, which greatly simplifies the choice of a computer with the effective execution time. As a hypothesis, the project formulates the assumption that this method of predicting the execution time can be applied not only to MPI programs.

In addition, in the course of the MC2E project, a solution for building the BоD service was proposed and investigated. This approach is divided into the problems of route aggregation and distribution of application flows. Various options for solving the problem of route aggregation were studied, depending on the network topology and the capabilities of the network equipment. The question of the distribution of applied flows between aggregated routes is formulated and solved in the form of ILP problems.

It would be naive to believe that the results presented fully cover all the problems arising in the creation of information and computing environments for interdisciplinary research. Let us list just a few of those that remained outside the scope of the MC2E project: automation of launching programs on various computers in an environment where there are HPC-C and HPC-S resources; coordinated management of data and network resources in real time; monitoring and analytics, administration and security in such environments [45]; accounting of resources consumed; settlements between the members of the federation; and the use of peripheral computing technologies [46].