1 Introduction

Currently, modern distributed storage systems have to deal with a growing number of files [18, 39, 51, 54], and an increasing use of huge directories with millions or billions of entries accessed by thousands of clients at the same time [2, 7, 18, 39]. To manage both problems (or, at least, the growing number of files), some file systems use a small cluster of specialized metadata servers [13, 45, 49, 53], while others plan to provide a similar service shortly [44].

Our Fusion Parallel File System (FPFS) uses Object-based Storage Device+ (OSD+) [3] to implement such a metadata cluster. OSD+s are improved Object-based Storage Device (OSDs) that, in addition to handle data objects, as traditional OSDs do, can also manage directory objects. Directory objects are a new type of object able to store file names and attributes, and support metadata-related operations, like the creation and deletion of regular files and directories. Using these OSD+ devices, an FPFS metadata cluster is as large as its corresponding data cluster and effectively distributes metadata among as many nodes as OSD+s comprising the system. OSD+s are implemented through a thin software layer on top of existing mainstream computers, which leverages many features of the underlying file system. Thanks to this approach, OSD+s have a small overhead and provide a large performance. Indeed, FPFS is able to create, stat and delete thousands of files per second with a few OSD+ devices [3]. FPFS also supports huge directories by dynamically distributing them among several OSD+s in the cluster. The OSD+s storing a distributed huge directory work independently of each other, thereby improving the performance and scalability of the file system.

Despite its great performance, FPFS shares with other distributed file systems one of their limiting factors: the interconnect. The network latency and the overhead introduced by the processing of messages and packets limit the number of metadata operations per second that a server can dispatch. Interconnect characteristics also affect data operation, but, since applications can issue large data transfers, bandwidth is the main limiting factor here. Therefore, to increase the metadata performance, we can use a better interconnect or reduce the network processing overhead.

In this paper, we propose to reduce the processing overhead per message by sending several metadata operations to a server in a single request that we call batch operation (or batchops for short). Batching is not a new idea, and it has been used extensively in network systems (see Sect. 6). However, to the best of our knowledge, this is the first time that it is applied in the domain of parallel file systems. Batchops leverage the directory objects of FPFS and embed hundreds to thousands of entries of the same directory (i.e., same directory object) into a single message to perform a given operation on all of them. Applications use batchops through new POSIX-alike functions (openv, statv, etc.), following the idea that many exascale challenges need to be faced with APIs beyond POSIX [12].

We also show how batchops are integrated in FPFS and, specially, in its distributed huge directories. This kind of directories makes the design and implementation of batchops difficult because a regular directory can become distributed during a batch request, and because we should take advantage of the distribution of huge directories to efficiently implement batch operations on such directories.

Batchops are possible in FPFS because its namespace is distributed based on directories, which usually contain related files. Therefore, batchops are particularly useful for applications that need to create, get the status information of or delete thousands or millions of entries in the same directory. For instance, applications that use a directory as a light-weight database [42], and operations like ls -l or rm -fr, can significantly benefit from batchops. But batchops are also useful for parallel file systems that need to migrate or distribute directories (hence, moving a large number of directory entries), as FPFS does. Note, however, that file systems that distribute their namespaces by means of other strategies, such as file hashing [9, 29, 52, 55], make operations like batchops difficult when not impossible.

Batch operations make a much more efficient use of the network, shifting the bottleneck from the network to the servers in many cases. With batchops, FPFS improves its metadata performance by saving network delays and round-trips, and by reducing the number of messages, which, in turn, can mitigate a possible network congestion.

The present work contributes an extensive set of experimental results for batchops on FPFS when using different backend file systems (Ext4 and ReiserFS) and devices in a Linux environment. Specifically, since a metadata service’s performance largely depends on the number of IOPS supported by the underlying storage [18], we have compared results obtained by hard disks with those achieved by “seek-free” SSD devices.

Results show that batchops can increase the performance of FPFS by 50 % at least when creating a single shared directory, achieving a 100 % improvement in some cases. For the stat operation, improvement provided by batchops is always around 25 %. Finally, when deleting files, the backend file systems determine performance to a large extent, being Ext4 the one that better leverages batchops with an improvement of 60 %, while ReiserFS obtains a 23 % when using this kind of operations. In absolute terms, batchops allow FPFS to create, stat and delete around 200,000, 300,000 and 200,000 files per second, respectively, with just 8 SSD-OSD+ devices (i.e., OSD+ devices supported by SSD drives) and a regular Gigabit interconnect. Unfortunately, other available parallel file systems, such as Ceph [53], Lustre [36] or OrangeFS [49], do not support batch or bulk operations, so we have not compared FPFS with them.

The rest of the paper is organized as follows. Section 2 describes the FPFS architecture. Section 3 details how batchops are designed and implemented. Results are provided in Sects. 4 and 5. Related work is described in Sect. 6. Finally, Sect. 7 concludes the paper.

2 Architecture of FPFS

Generally, parallel file systems have three main components: clients, data servers and metadata servers. Data servers are usually OSD [31] or OSD-alike devices that export an object interface. Metadata servers, however, frequently implement customized interfaces and permanently store metadata in private storage devices [8] or in objects allocated in the data servers themselves [53].

Unlike these file systems, FPFS  [3] uses a single kind of server that acts as both a data and a metadata server (see Fig. 1), and it consequently enlarges the metadata cluster’s capacity that becomes as large as the data cluster’s. To merge data and metadata servers into a single one, FPFS uses OSD+ devices that are new enhanced OSD devices. OSD+s are capable of managing not only data (as common OSDs do), but also metadata. These new devices simplify the complexity of the storage system as well, since no difference between two types of servers is made. In addition, having a single cluster increases system’s performance and scalability.

Fig. 1
figure 1

FPFS ’s overview. Each OSD+ supports both data and metadata operations

2.1 OSD+

OSDs implement not only low-level block allocation functions, but also more complex tasks by taking advantage of their intelligence [41, 53]. OSD+s leverage this intelligence too, taking it a step further by delegating metadata management to storage devices.

Traditional OSDs deal with data objects that support operations like creating and removing objects, and reading from and writing to a specific position in an object. Our design extends this interface to define directory objects, capable of managing directories. OSD+ devices support metadata-related operations like creating and removing directories and files, or getting directory entries. In addition to the usual operations on directories, OSD+s also provide functions to internally deal with metadata operations that may involve the collaboration of several OSD+s (e.g., renames, directory permission changes, and links).

Currently, there exist no commodity OSD-based disks, so mainstream computers exporting an OSD-based interface through emulators [1], or other software elements, are usually used.Footnote 1 Internally, a local file system stores the objects; we take advantage of this by directly mapping operations in FPFS to operations in the local file system.

Each OSD+ is composed of a user-space multithreaded process and a conventional file system. The user-space multithreaded process uses the file system as storage backend. Figure 2a shows the layers that compose an OSD+. The local file system must be POSIX-compliant and support extended attributes (used by our implementation). The Linux syscall interface is used to access the local file system.

Fig. 2
figure 2

a OSD+ layers. b Mapping of an FPFS namespace to a 4-OSD+ cluster

2.2 Clients

Clients are the processes accessing FPFS. For fast prototyping and evaluation, the current implementation entirely runs clients in user-space. There exists an FPFS library (libfpfs) that clients use to issue requests. This approach is similar to that used by PVFS2/OrangeFS [49] and other file systems [44].

FPFS establishes communications between clients and OSD+s via TCP/IP connections, and request/reply messages. Each OSD+ launches one thread to attend the requests of a client, and to perform operations on the local disk on behalf of the client. When an operation involves several OSD+s, the OSD+ contacted by a client carries out the operation transparently to the client (see Sect. 2.4).

2.3 Namespace distribution

FPFS distributes directory objects (and so the file-system namespace) across the metadata cluster to make metadata operations scalable with the number of OSD+s, and to provide a high-performance metadata service. For the distribution FPFS uses the deterministic pseudo-random function CRUSH [53]:

$$\begin{aligned} oid = CRUSH(hash(dirname)). \end{aligned}$$
(1)

CRUSH receives the hash of a directory’s full pathname as input and returns the ID of the OSD+ containing the corresponding directory object as output. This allows clients to directly access any directory without performing a path resolution.

We choose CRUSH because it guarantees a probabilistically balanced distribution of objects through the system. However, FPFS does not depend on a particular distribution function, so other functions are also possible [33].

Hash partition strategies present different scalability problems on cluster resizings, permission changes, and renames. FPFS addresses the first problem through CRUSH, which minimizes migrations and imbalances when adding and removing devices. FPFS manages renames and permission changes via lazy techniques [9]. Fortunately, these operations are infrequent for directories [3, 9], so they will not impact the overall performance.

2.4 Directory objects

A directory object is implemented as a regular directory in the local file system of its OSD+. In this way, any directory-object operation is directly translated to a regular directory operation. The full pathname of the directory supporting a directory object is the same as that of its corresponding directory in FPFS. Therefore, the directory hierarchy of FPFS is imported within the OSD+s by partially replicating its global namespace.

Internally, an OSD+ uses three types of directories, differentiated through extended attributes. These directory types can be seen in Fig. 2b, which shows how an FPFS ’s directory hierarchy is mapped to a 4-OSD+ cluster. The first type (attribute o) is assigned to directory objects stored in the OSD+, i.e., objects that CRUSH and their full pathnames have assigned to the OSD+. The second type (attribute h) refers to empty directories created in a directory object; they represent subdirectories and allow FPFS to preserve the complete file-system hierarchy to provide standard directory semantics (e.g., scan). The third one (no attribute) is for directories used for supporting the paths of the directories implementing objects.

For each regular file that a directory has, the directory object stores its attributes, and the number and location of the data objects that store the content of the file. In our current implementation, these “embedded i-nodes” [19] are i-nodes of empty files. The number and location of the data objects of a file are also stored in extended attributes of its associated empty file. The exceptions are the size and modification time attributes of the file, which are stored at its data object(s), so the directory object does not store this information.

Therefore, an FPFS client first contacts the OSD+ storing the directory object of a target file to obtain its data layout. With that information, the client can send read/write operations to the OSD+s storing the corresponding data objects. The same procedure is followed by other parallel file systems.

Implementing directory objects by means of regular directories in a local file system has, at least, two important advantages. The first one is that the implementation is simpler and its overhead smaller since most part of the functionality is provided by the underling file system. The second one is that, when a metadata operation is carried out by a single OSD+ (creat, unlink, etc.), the backend file system itself ensures its atomicity and POSIX semantics. Only for operations like rename or rmdir, that usually involve two OSD+s, the participating OSD+s need to deal with concurrency and atomicity by themselves through a three-phase commit protocol (3PC) [46], without client involvement.

2.5 Huge directories

FPFS also implements management for huge directories, or hugedirs for short. These are directories with millions or billions of entries accessed by thousands of clients at the same time. Hugedirs are common for some HPC applications, such as those that create a file per thread/process [17, 35], and those that use a directory as a light-weight database (e.g. check pointing) [40]. Hugedirs can become a bottleneck; therefore, they should be handled properly.

To efficiently manage hugedirs, FPFS proposes a dynamic distribution of its entries among multiple OSD+s  [4]. FPFS considers a directory is huge when it stores more than a given number of files. Once this threshold is exceeded, the directory is shared out among several nodes. The threshold can also be 0, thereby distributing a directory right from the start. This is useful for directories known to be huge.

The subset of OSD+s supporting a hugedir is composed of a routing OSD+ and a group of storing OSD+s. The former contains the routing directory object and is in charge of providing clients with the hugedir’s distribution information. The latter has the storing directory objects that store the directory’s content. The storing directory objects are those OSD+s contacted by clients aware of the directory’s distribution. The routing OSD+ can also be part of the storing group in case it keeps any directory’s content. For small directories, the routing and storing objects are the same; hence, a directory object can play both roles.

A client unaware of the distribution of a directory contacts its routing object using Eq. 1, as it does with any other regular directory object. As reply, the client receives the distribution list with the ids of the routing and storing OSD+s. Then, the client retries the operation, but changes the previous directory-level function for a file-level counterpart:

$$\begin{aligned} oid = osd\_set[(hash(filename)\% osd\_count], \end{aligned}$$
(2)

where osd_set is the list of storing objects, and osd_count is the size of that list. As index, we use the value returned by a hash applied on the file name. The result is the storing OSD+ having the file entry.

The distribution list of a hugedir is cached by the client accessing the directory. FPFS uses timestamps to detect when this cached information gets out of date due to the rename or deletion of a hugedir. In that case, clients clean up the cached information for the directory and retry the operation following Eq. 1. If the directory gets huge again, clients will receive a new distribution list and will change to Eq. 2.

3 Batch operations

In this section, we describe how FPFS increases the rate of metadata operations by sending several operations to a server in a single request. These new requests, which we call batch operations or batchops for short, are particularly useful for applications that concurrently handle thousands or millions of files.

A batchop embeds hundreds or thousands of entries of a directory in a single request to perform a given operation on all of them. A batchop is sent as a single network message. The message for a batchop includes operation type, directory name, list of directory entries, and operation parameters, whereas a regular operation includes operation type, full pathname, and operation parameters. For creation operations, a batchop also includes a semantics that indicates how to act on failures. Our current implementation considers two options: perform-all-operations or stop-on-failure. The first option tells a server to perform the operation for all the entries regardless of its outcomes. The second option tells a server to stop on the first failed operation. Both decisions are local to a server. Therefore, when a batchop is performed by several OSD+s (see Sect. 3.1), each one will locally apply the given semantics to the operations it has to carry out.

As an example, Fig. 3a shows the format of a regular create operation, while Fig. 3b depicts a batch create operation. The directory name is specified separately in a batch operation, since it is the same for all the entries.

Fig. 3
figure 3

Request messages for create operations

Once the message is received on the server’s side, the server performs the operation for all the specified entries over the corresponding directory by taking into account the semantics parameter. Operation results are batched as well, and the server sends this information back when all the operations are completed or the first operation fails, depending on the semantics parameter. Therefore, the semantics not only informs servers about how to perform the batchop, but also informs clients about the reply they will receive.

A reply message for a batchop includes three fields (see Fig. 4b): operation type, #errno and list of errnos. #errno is the number of performed operations, which is also the number of elements in the list of errno values. list of errnos contains the returned errno value for each performed operation (actually, the errno is returned as a negative number). The first value in this list corresponds to the operation on the first file in the batch request, the second value to the second file, and so on.

The reply of some batchops contains additional fields. This is the case for stat, where an operation returns not only the operation result but also information about the requested files. Therefore, the reply message for a stat includes two extra fields: #succ values and list of infos. #succ values is the number of successful operations. list of infos contains the stat information for each file, so the length of this list is the same as #succ values. Figure 4c depicts the format of this kind of reply messages.

Fig. 4
figure 4

Reply message format for regular operations and batchops

Operations supported by our current implementation of batchops are: openv, closev, statv and unlinkv. Their signatures appear in Fig. 5. All of them, except for closev, follow the message format previously described. In the case of closev, instead of a directory name and list of file names, we send a list of open file descriptors. The reply of closev follows the same format as the rest, with the lists of successful values and errno values.

Fig. 5
figure 5

Signatures of batchops supported by our current implementation of FPFS

As we have said, the interconnect can become a bottleneck on many parallel file systems. By introducing batchops, we significantly reduce the number of messages transmitted between the client and the server and that, in turn, reduces the number of packets transmitted through the TCP/IP stack. For instance, a batchop can create 8192 files in a directory with only 2 messages (one request and one reply) instead of 16384 messages (8192 requests and 8192 replies); the number of network packets transmitted will be significantly reduced as well, although it will depends on the size of the batchop messages and the limit impossed by the TCP/IP stack.Footnote 2 Therefore, with batchops, we reduce the network traffic, optimize the network bandwidth, and reduce the network overhead due to the processing of packets and messages. Indeed, the improvement achieved by batchops is to a large extent due to the reduction of the network time obtained. Moreover, thanks to batchops, servers receive more work in each batchop message, so they can operate more efficiently, making a better use of caches, disks, etc. in many cases.

Finally, it is important to note that batchops are possible in FPFS because of the way it implements directories. Since every directory corresponds with a directory object (or a few directory objects if it is distributed), and every directory object is stored in a single OSD+, it is easy and makes sense to bundle several related file operations into a single message. Batchops, however, provide little (or none) benefit in other file systems, such as those that distribute the namespace by hashing file names, since every file of a directory can potentially be stored in a different server (hence, each file operation will be issued in a separate message), and files stored in the same server are not probably related.

3.1 Batchops over huge directories

As explained in Sect. 2.5, FPFS handles huge directories by storing them among a group of OSD+ devices. Therefore, batchops on hugedirs have to be handled differently than on regular directories. To exploit the hugedir distribution, clients perform a batchop on a hugedir by sending batch messages in parallel to every storing OSD+ composing the hugedir. Each of those messages contains the directory entries of the original batch message that are stored on the destination OSD+. Once a client receives all servers’ replies, it sorts them in the same order in which they were initially requested. Note that an application does not need to know whether a directory is distributed or not in order to issue a batchop to it. The FPFS library (see Sect. 2.2) used by the application takes care of the distribution, and transparently performs the requests in parallel and reorganizes the replies when a directory is distributed.

We have explained that semantics are local to servers, and this is specially true for the semantics stop-on-failure on hugedirs. For these directories, since requests are sent in parallel to different servers, there is no way (at least, not without losing parallelism) of stopping the processing of requests on the servers when an operation fails in one of them. Therefore, the semantics should be necessarily local if we want to improve the performance. This design decision also means that a client has to process the whole list of errnos of the batchop reply to verify the return value of each operation.

To clarify the use of batchops on hugedirs, let’s look at the example in Fig. 6. An application performs an open batch request (openv) to open sixteen files on the directory /home/usr3 (step 1), which is distributed. The FPFS library in the client is already aware of the distribution of the directory and has cached its corresponding distribution list (0 as routing, and 2, 4, 8, 10 as storing OSD+s) (step 2). The library composes four open batchop messages by calculating the storing OSD+ of each file through the distribution list and the distribution function of hugedirs (see Eq. 2). Next, client sends in parallel those four batch requests to the storing OSD+s (step 3). Once the servers perform the operations, they send to the client the batchop reply with the list of return values (step 4). In this example, we assume that the creation of f7 and f12 files fails. We also assume a stop-on-failure semantics, so OSD+ 2 does not create the f14 file after the failed creation of the f12 file, and OSD+ 8 does not create the f10 and f15 files after the failure on the f7 file. Finally, the FPFS library in the client sorts the replies in the initial order in which they were requested by using again the distribution list and the distribution function (step 5). For the sake of simplicity, we have used 0 as the return value of a successful operation (it is actually a file descriptor), a negative value (\(<\)0) for a failed operation, and N for an operation that has not been performed due to the semantics.

Fig. 6
figure 6

Example of a client requesting a batch open (openv) on a hugedir using semantics stop-on-failure. The creation of files f7 and f12 fails, so files f10, f14 and f15 are not created. Reply reflects the result

Our implementation of batchops also considers the case when a regular directory becomes huge during the processing of a batch request, particularly when such a batch request creates hundreds or thousands of files. We manage this situation at the storing OSD+ of the regular directory by processing one operation of the batch request at a time. If the directory gets huge, the processing stops, the OSD+ distributes the hugedir among the storing OSD+s, and a reply containing the results of the already completed operations of the batch request and the distribution list of the now huge directory is sent back to the client that issued the batch request. This client can then continue issuing more batch requests, which will proceed in parallel as we have already described.

4 Experiments and methodology

To analyze the performance of a metadata cluster of FPFS supporting batchops, we have run different benchmarks, and compared FPFS’ performance with and without batch operations. This section describes the system under test, the benchmarks run to carry out the analysis, and the objectives that our experiments pursue.

4.1 System under test

The testbed system is a cluster made up of 12 compute and 1 frontend nodes. Technical specifications of each compute node are summarized in Table 1. Test disks support OSD+ devices. We call HDD-OSD+ to an OSD+ device on a hard drive, and SSD-OSD+ to an OSD+ device on a SSD drive.

Table 1 Cluster nodes’ technical specifications

As I/O scheduler, CFQ is set for HDDs, whereas Noop is set for SSDs. CFQ is the default I/O scheduler in the Linux kernel since 2.6.23. Noop usually achieves the best performance for SSDs compared to the other available Linux schedulers [14, 28].

Since metadata performance depends on the backend file system, we use Ext4 and ReiserFS as backend file systems; formatting and mounting options are also important [3]. We format Ext4 file systems with options

figure a

which set the journal size, bytes-per-inode ratio, i-node size, and use of hashing in directories, extents and some structures uninitiated, respectively. They follow options used by Lustre when formatting its metadata server [47]. In the case of ReiserFS, we use the option

figure b

to set the journal to 32749 blocks (of 4 kB), which is its maximum allowed size when not on a separate device. Mount options are quite similar for both file systems, and try to increase the metadata performance obtained by each one. For Ext4, we use noatime, nodiratime and data=writeback, while we use notail, noatime and nodiratime for ReiserFS. We have not used the discard option in Ext4 for issuing trim commands to the SSD-OSD+s since ReiserFS does not support this option.

The version of FPFS evaluated in this paper only supports metadata operations, since we focus on improving that kind of operations.

4.2 Benchmarks

To evaluate the performance of batchops in metadata operations, we use the following benchmarks:

  • Create each process creates a subset of empty files in either shared or non-shared directories. This benchmark basically generates a write-only metadata workload.

  • Stat each process gets the status of a subset of files in shared or non-shared directories. This is a read-only metadata workload (remember that noatime and nodiratime mount options are used).

  • Unlink each process deletes a subset of files in shared or non-shared directories. This is a read–write metadata workload.

Similar tests can be performed by means of well-known benchmarks such as mdtest [34] and even HPCS-IO [11, 35]. However, unlike those benchmarks, our tests support batchops of different sizes. We have not considered benchmarks that involve both data and metadata operations since, as aforementioned, we focus on metadata operations only.

4.3 Objectives

Through the experiments, we aim to analyze four different aspects of batchops:

  1. (a)

    Optimum number of operations per batch operation.

  2. (b)

    Throughput and scalability for a single shared directory.

  3. (c)

    Performance when several shared and non-shared hugedirs are accessed in parallel.

  4. (d)

    Performance when there are one shared and one non-shared hugedir accessed concurrently.

5 Results

The experiments evaluate the performance and scalability of batchops in FPFS considering the objectives described in the previous section. We use HDD-OSD+s and SSD-OSD+s as storage devices, and Ext4 and ReiserFS as backend file systems.

Since it is usual to find many processes running and accessing the storage in an HPC system, we use several clients (up to 256) in our experiments.

We use FPFS in all the test, since, to the best of our knowledge, no other similar file system provides batch operations or equivalent mechanisms. Therefore, a comparison between FPFS and other parallel file systems regarding batchops has not been possible.

Results shown for every system configuration are the average of at least five runs of each benchmark. Confidence intervals are also shown as error bars, for a 95 % confidence level. We format the test disks before every run of the create test, and unmount/remount them between tests for the rest.

Before discussing the results, we should mention an issue that has arisen during the experiments. Theoretically, batchops provide some clear benefits: they reduce the number of network messages interchanged and the network overhead, and increase the amount of operations per second sent to servers. Due to this, batchops can reduce the application time, which lessens the chance of a block of being rewritten, and hence, decreases the number of writes to disk issued by the kernel flush daemon. However, while batchops are usually beneficial for SSD-OSD+s, there are some cases where they downgrade the performance when HDD-OSD+ devices are used.

The problem with HDD-OSD+ devices is that more factors influence their performance, and it is not clear how batchops affect the results as a whole in some cases. For instance, the way files are allocated on disk can affect the performance. When files are created with batchops, a set of i-nodes for the same client can be allocated together on disk. However, if files are created without batchops, i-nodes are more likely to be stored in an interleaved pattern. These two forms of allocation affect performance, mainly because of two factors: head seeks and disk cache.

We have performed some internal tests (not shown here) to see the behavior of the stat and unlink tests after creating files with and without batchops. Also, we have calculated the number of read and write operations for each of these configurations. In those tests, we have seen that:

  • In the case of stat (read-only workloads), having the files created in an interleaved pattern (no-batch) obtains better results due to the prefetching performed by different caches. This prefetching allows clients to help each other by bringing to cache i-nodes from other clients. Conversely, when files are created with batch, a client only helps itself in the stat test, reading mainly its i-nodes. The other clients have to read their i-nodes, stored in different disk areas, by themselves; this causes larger head seeks, and can also evict from the disk cache blocks that could use other clients in the near future.

  • In the case of unlink, both read and write factors affect the test. Here, batchops help some configurations, but significantly downgrade performance in others. Reducing the time of the test by sending more operations to the servers allows us to reduce the number of writes (as in the create test), but we also need to consider the use of caches for reads in this test (as in the stat case). Given all this, we cannot always determine to what extent each factor affects.

Hence, considering the aforementioned findings, we only provide results with HDD-OSD+s for a single shared hugedir (see Sect. 5.2). For the remaining benchmarks, we only show results with SSD-OSD+ devices, as they always improve HDD-OSD+s’ results, and because the behavior of batchops is more homogeneous with SSD-OSD+s. Moreover, results with SSD-OSD+s have an easier explanation given that there are less factors influencing the results (especially, there are no head seeks).

5.1 Size of batch operation

We start measuring the optimum number of operations embedded per batch request. We perform a test where 256 clients create concurrently files in a single directory. When the directory is shared out among several OSD+s (i.e., it is distributed), the clients create \(N \times \) 400,000 files altogether, where N is the number of OSD+s. In our experiments, N is either 1, 2, 4 or 8, so the clients end up creating 400,000, 800,000, 1,600,000 or 3,200,000 files in total. Since the directory is uniformly distributed, every OSD+ receives around 400,000 files. When the directory is stored in a single OSD+ (i.e., there is no distribution), the 256 clients also create either 400,000, 800,000, 1,600,000 or 3,200,000 files altogether, making it easy to compare the results when the directory is distributed and not distributed.

Figures 7 and 8 show the throughput in operations/s when not distributing and dynamically distributing the directory, respectively. The figures show the performance for different batchop sizes (see Table 2). These tests use SSD-OSD+ devices. Results for HDD-OSD+s are equivalent, although they are not showed here.

Fig. 7
figure 7

Operations per second obtained by FPFS on SSD-OSD+s with different number of operations embedded in a batchop, when 256 clients create, stat or unlink files on one non-distributed shared directory. NoBa means that batching is not applied. Graphs on the left show results for Ext4 and those on the right for ReiserFS. Note that the range of the Y axis can change from one test to another

Fig. 8
figure 8

Operations per second obtained by FPFS on SSD-OSD+s with different number of operations embedded in a batchop, when 256 clients create, stat or unlink files on one distributed shared directory. NoBa means that batching is not applied. Graphs on the left show results for Ext4 and those on the right for ReiserFS. Each SSD-OSD+ ends up storing 400,000 files. Note that the range of the Y axis can change from one test to another

Figure 7 shows that when the shared directory is not distributed, with 1000 operations per batchop we increase the performance between 34 and 73 % for Ext4 with respect to no-batch operations, and between 38 and 75 % for ReiserFS, depending on the test and number of OSD+s. We also see that we already achieve almost the maximum possible improvement with only 50 operations per batchop, for both Ext4 and ReiserFS, and for any test. Indeed comparing Ba-50 with Ba-1000, although the number of operations included in the batchop is increased by 20\(\times \), Ba-1000 only improves performance by up to 21.5 % and, on average, only 8.4 %. Note that in this test there is a single server receiving concurrent requests from 256 clients. Therefore, the server is saturated and more operations per batch cannot improve the performance further.

Regardless the number of operations per batch, the performance downgrades when the size of the directory increases. This problem is more evident in the unlink test when the backend file system is Ext4. The problem is that Ext4 handles larger directories worse than ReiserFS and, despite SSD-OSD+ devices help avoiding this problem by removing seek latencies, it is still noticeable in this test.

Figure 8 depicts the results when dynamically distributing the directory. We use a dynamic distribution that shares out the directory when it exceeds 8,000 files. Once distributed, the directory object size becomes the same on each OSD+, resulting in a balanced workload. The larger the number of operations per batch, the better the throughput. In general, the largest performance is obtained at 500 or 1000 ops/batch, although, similar to not distributed, Ba-1000 only improves on average 13.0 % the performance achieved by Ba-50.

Focusing on 1000 ops/batch, the largest improvements are obtained in the create and stat tests. With Ext4, performance is improved between 39 and 88 %, and with ReiserFS, between 46 and 72 %. With both file systems and only 8 SSD-OSD+s, and thanks to batchops, we can produce more than 200,000 creates/s and 350,000 stats/s.

In the unlink test, batchops also improve performance significantly. With Ext4, we gain between 25 and 55 %. With ReiserFS, we obtain improvements between 21 and 23 %. In the case of Ext4, thanks to batchops, we reach 200,000 unlinks/s with just 8 SSD-OSD+ devices. There is, however, an odd behavior of Ba-10 on Ext4. As aforementioned, although batchops always reduce the network time, there can appear side effects that can improve the performance even more, or downgrade the performance despite the reduction in network time, specially for small batchops. This seems to be the case for Ba-10, which reduces the amount of bytes written by 10 % with respect to NoBa when there are 8 OSD+s, but increases that amount by 8.6 % when using 4 OSD+s, thereby eliminating the improvement that the reduction in the network time could provide to the overall time. We have not found a satisfactory explanation for this different behavior of Ba-10 yet.

Table 2 Batch-operation sizes evaluated

To summarize, given these results, particularly when the directory is distributed, embedding 500 or 1000 operations per batch request seems a good option. Larger requests, although possible, would provide little benefit, since the improvement achieved when going from 500 to 1000 is usually small already, and even negative in some cases. Indeed, Ba-1000 includes 2\(\times \) more operations per batchop than Ba-500, but it improves only by up to 6.1 % the performance provided by Ba-500.

5.2 Single shared directory

Now, we compare the performance and scalability of batch and regular operations on a single distributed shared huge directory. In this test, a hugedir is accessed by 256 clients at the same time to create, get the status of and delete files. In addition, we evaluate the effect of the directory size by creating \(F \times N\) files in the directory, where F is either 200,000, 400,000 or 800,000, and N is the number of OSD+s. For instance, when having 8 OSD+s, the directory has 1,600,000, 3,200,000 or 6,400,000 files, respectively, that are equally distributed among the 8 OSD+s. Unless otherwise indicated, each batch request includes 1000 file operations.

For Ext4 and ReiserFS, Figs. 9 and 11 depict FPFS performance in operations/s obtained with HDD-OSD+ and SSD-OSD+ devices, respectively. Figures 10 and 12 show the speedup achieved for the same devices. In the figures, results are labeled as “N fi, Ba” and ”N fi, NoBa”, where N corresponds to the final number of files in every directory object (or OSD+, since there is only one directory object per OSD+ in this test), and ”Ba” and “NoBa” stand for batching and no batching, respectively.

5.2.1 HDD-OSD+

Results for HDD-OSD+s and a single shared hugedir are depicted in Figs. 9 and 10. As we advanced in the beginning of the section, with HDD-OSD+s there are more factors involved in the results, and it is not always clear to what extent they affect the different configurations. Moreover, since the behavior and performance observed here are repeated in the other tests carried out with HDD-OSD+, conclusions showed here can be extrapolated to a large extent.

Fig. 9
figure 9

Operations per second obtained by FPFS with HDD-OSD+s when using one shared hugedir. Graphs on the left show results for Ext4 and those on the right for ReiserFS. Note that the range of the Y axis can change from one test to another

Fig. 10
figure 10

Scalability obtained by FPFS with HDD-OSD+s when using one shared hugedir. Graphs on the left show results for Ext4 and those on the right for ReiserFS. Note that the range of the Y axis can change from one test to another

For the create tests, Fig. 9a shows that batchops always perform better than no-batch. Namely, with Ext4, batchops improve performance over 50 % for 8 OSD+s, whereas with ReiserFS, the improvement of batchops is more than 40 %. The configurations with no-batch suffer the network limitation. In the create test, four network messages are generated per file: two (request and reply) for an open or creat call, and another two for closing the returned file descriptor. This significantly increases the amount of network messages compared to the other tests, and, therefore, batchops are more effective. For instance, when creating 200,000 files, FPFS interchanges 800,000 network messages when regular operations are used, whereas with batchops only 800 messages are interchanged since each batch request includes 1000 operations. Moreover, this benchmark only issues write requests that go to cache, instead of directly accessing the disk. Therefore, batchops perform better than regular operations by sending more requests in each message.

However, for stat and unlink, batchops are not always beneficial, and performance of this test depends not only on the backend file system, but also on how files are created, as we have already explained at the beginning of this Sect. 5.

For stat and Ext4, batchops improve performance compared to no-batch for small directory objects (200,000 files), and for 4 and 8 OSD+s for larger directory objects, where the reduction of network traffic and the higher number of operations per second are more noticeable. For ReiserFS, batchops also improve for directory objects with small number of files, and for 400,000 with 4 and 8 OSD+s, but not for 800,000 files. In general, as directory objects are larger, batchops performance decreases mainly because of the increase in head seeks, and the poor use of caches compared with no-batch. To understand this fact, we should take into account the way files are created and then read (see the beginning of Sect. 5). When we use no-batch, files are in an interleaved pattern. Each disk block read by a client will probably help other clients because it will probably contain some of their i-nodes. A side effect of this is that clients roughly proceed at the same pace, so disk heads usually move forwards and disk caches are more efficiently used too. When using batchops, each client only helps itself, so the other clients have to issue read requests that produce large head seeks forwards and, what is worse, backwards, incurring in large latencies. This behavior is more noticeable as directory objects grow, and specially in ReiserFS, where batchops perform 60 % worse than regular requests.

In the unlink test, we have two different behaviors depending on the file system (see Fig. 9c). Ext4 benefits from batchops for 4 and 8 OSD+s in any case, while ReiserFS only benefits for 4 and 8 OSD+s when there are 200,000 files and 400,000 file per OSD+. As before, performance downgrades as the size of the directory grows. Several factors are intervening here, particularly the type of workload, which mixes reads (similar to those issued by stat) and writes. As we have just seen, both Ext4 and ReiserFS downgrade performance with batchops for some configurations in stat, and this problem also affects reads in this unlink test.

However, when it comes to writes, Ext4 benefits from batchops, since many disk blocks are entirely modified in a short time (e.g., blocks full of i-nodes of files of the same client), and are usually written to disk only once despite the frequent commits in Ext4. This reduces the duration of the test, which in turn also reduces the chance of a block of being modified several times and, therefore, it reduces (again) the amount of writes. Without batchops, a disk block (e.g., a block with i-nodes from different clients) can be modified at different moments, and written to disk several times. This increases the number of writes and head seeks, so the test takes longer.

ReiserFS also reduces the number of write requests, but its performance is significantly determined by its behavior for read requests, as that seen in the stat test. Therefore, batchops achieve better results when directories are smaller (200,000 files per OSD+). ReiserFS uses a B-tree+ to store the directory and, apparently, that tree produces a more random pattern to place files on disk, which later produces a worse use of caches.

Fig. 11
figure 11

Operations per second obtained by FPFS with SSD-OSD+s when using one shared hugedir. Graphs on the left show results for Ext4 and those on the right for ReiserFS. Note that the range of the Y axis can change from one test to another

Fig. 12
figure 12

Scalability obtained by FPFS with SSD-OSD+s when using one shared hugedir. Graphs on the left show results for Ext4 and those on the right for ReiserFS. Note that the range of the Y axis can change from one test to another

Figure 10 shows that for HDD-OSD+s scalability is super-linear, and usually better with batchops than with regular requests. Directory size impacts performance, therefore for non-distributed configurations performance with batchops significantly downgrades. Scalability for ReiserFS is usually smaller than for Ext4, because ReiserFS is less sensitive to the directory size. For example, Ext4 achieves a scalability higher than 30 for unlink, while ReiserFS slightly exceeds 20. Also, in the create test, Ext4 achieves larger speedups than ReiserFS when the shared directory is large (that is, when there are 8 OSD+s).

5.2.2 SSD-OSD+

Results for SSD-OSD+s and a single shared hugedir are depicted in Figs. 11 and 12. Now, batchops perform better than regular operations for all the tests.

Batchops are specially helpful for the create test, because they reduce the network traffic. Batchop significantly increases the number of requests per second for each OSD+ by sending more requests to each server in each message, and sending them in parallel to all the servers too. Thanks to batchops, FPFS is always able to improve performance by 50 % at least, doubling the number of operations per second in some cases of the create test.

Ext4 takes more advantage of batchops with SSD-OSD+ devices than with HDD-OSD+s in the create test. For instance, with 400,000 files per OSD+, batchops increase the number of files created per second by 30 % for SSD-OSD+s and Ext4 with respect to the results obtained for HDD-OSD+s, while they only improve the results by 5 % when the backend file system is ReiserFS.

In the stat test, for both Ext4 and ReiserFS, batchops improve performance by, at least, 25 %. The improvement is smaller than in the create test because the reduction in network traffic is smaller too, since stat already produces half the network traffic than create.

In the unlink test, the backend file system determines the results to a large extent, being Ext4 the file system that better leverages batchops. Specially for large directories, Ext4 performs a 60 % better with than without batchops, while ReiserFS achieves a 23 % of improvement. This is because batchops cause a better use of the different caches when Ext4 is the local file system. Batchops allow the serving threads in the storage nodes to carry out a request immediately after the previous one, without waiting for a new request from a client after serving a request. This specially helps Ext4 which reads and writes more blocks than ReiserFS. For 800,000 files, Ext4 exceeds RAM capacity, and using batch helps reducing the number of written blocks. By writing less, we also improve the reads performance, since there is less competition for disk. In the case of ReiserFS, it does not exceed the maximum capacity of RAM for our tests. Batchops still provide some benefits, but they are less noticeable.

Therefore, with batchops, disk blocks in the buffer cache, fetched during the processing of a request, are likely to be used in the next request of the same thread before being evicted by requests of other threads. For ReiserFS, batchops provide a smaller benefit. Since ReiserFS produces a quite “random” access pattern from a cache’s point of view [22], the improvement that can be obtained from the “aggregate disk cache” is limited.

Figure 12 shows that batchops hardly affect scalability. For the create test, the most noticeable change is for 800,000 files per OSD+, and 8 OSD+s, where scalability is super-linear. In the stat test, however, batchops slightly reduce the scalability for ReiserFS, and for unlink it remains super-linear for both Ext4 and ReiserFS. With batchops, we significantly reduce the amount of network traffic, specially when the directory is not distributed. Therefore, when we distribute a directory with batchops, the network reduction is not as high as the one achieved with no-batchops.

These results diverge from the ones with HDD-OSD+s, where batchops significantly increased the scalability. While, with batchops and HDD-OSD+s, the directory size significantly affected several tests, with SSD drives, again, we remove all the head seeks that provoked this increment.

5.3 Multiple hugedirs

Table 3 Performance obtained by FPFS on SSD-OSD+ devices with Ext4 and ReiserFS, when 8 hugedirs are accessed concurrently and no-batch operations are used
Table 4 Performance obtained by FPFS on SSD-OSD+ devices with batchops, and Ext4 and ReiserFS, when 8 hugedirs are accessed concurrently and batch operations are deployed

Distribution is beneficial for a single hugedir accessed by hundreds or thousands of clients. However, results can be rather different when several hugedirs are concurrently accessed by a few clients. In this section we analyze the performance of batchops when several huge directories are concurrently accessed by a few clients by using SSD-OSD+ devices. The following tests use 8 directories (each containing 320,000 files) accessed by 1, 16 and 32 clients per directory. Note that, 1 client per directory is an example for non-shared directories, and with 32 clients per directory, there are 256 clients altogether. Once again, each batchop request includes 1000 operations.

Tables 3 and 4 show, for each number of processes per directory, absolute application times when hugedirs are never distributed in the first column. The column labeled Dyn gives relative application-time variations, in percentage, with respect to the absolute times, when hugedirs are distributed dynamically (i.e., when a directory exceeds 8000 files). The column labeled Alw also gives relative application-time variations, in percentage, with respect to the absolute times, but when any directory is always distributed (i.e., when threshold is 0). Confidence intervals (not showed) are smaller than 10 % of the mean. A positive/negative percentage means an increase/decrease in time, and, hence, a worse/better performance. While Table 3 shows results for SSD-OSD+s without batch operations, Table 4 shows results for SSD-OSD+s with batchops.

The first thing we can observe is that, when comparing absolute times in columns never, batchops improve performance in general (for both Ext4 and ReiserFS), specially when there is one client per directory. When directories are distributed, results obtained by batchops are more variable, as it already happens with regular operations, and they also depend on the number of OSD+s, number of processes per directory and backend file system. However, there are some noticeable differences now with respect to a system without batchops.

For Ext4 and the create test, distribution and batchops improve results with respect to never when there is 1 client per directory, but slightly downgrade them when the number of clients per directory grows. However, absolute times are inferior now in any case. For the stat operations, results are comparable with those without batchops, except for 1 client per directory and 8 OSD+s, where the distribution with batchops increments the application time from 5 to 56 %. For the unlink test, distribution with batchops behaves much better than without batchops, and now there is only a small increase in the application time. Moreover, with 8 OSD+s, batchops are able to significantly reduce the application time. Exception appears for 1 process per directory and 2 OSD+s, although, considering absolute times, batchops still reduce the application time considerably.

For ReiserFS and the create benchmark, the behavior is similar to that of Ext4. For the stat test and 32 clients per directory, results are comparable to those we have without batchops. For 1 and 8 clients per directory and for 8 OSD+s (and, sometimes, 4 OSD+s), distribution increments times respect to never more than when we do not have batchops. For the unlink case, differently to what happens with regular operations, distribution and batchops reduce the application time with respect to never.

In summary, although the distribution of hugedirs can downgrade the performance in some cases, results also show that batchops can help to reduce the possible negative effects caused by such distribution. We believe this is because the threads attending requests in the servers can process more requests in a shorter time. This improves caches’ performance and reduces the overhead produced by disk contentions. The results obtained with hard drives (not included) confirm these findings.

5.4 Mixed directories

Fig. 13
figure 13

Operations per second obtained by FPFS with SSD-OSD+s when a distributed hugedir and a non-distributed hugedir are concurrently accessed by 256 clients. Graphs on the left show results for Ext4 and those on the right for ReiserFS. Note that the range of the Y axis can change from one test to another

Figure 13 depicts the throughput in operations/s achieved by FPFS with SSD-OSD+ devices when two hugedirs, a distributed one and a non-distributed one, are accessed at the same time by 128 clients each. Results are labeled as “Dis-Ba”, “No-Dis-Ba”, “Dis-NoBa”, and “NoDis-NoBa”, where “Dis” stands for distributed and “Ba” for batching. There are always 1,280,000 files per directory, evenly shared out among clients. Again, each batchop includes 1000 operations. Batchops always improve the performance of both directories in all cases, and, as in a single shared directory, the reduction in network traffic and a better use of the caches explain the improvements.

In the create test, batchops achieve an improvement of more than 30 % for both the non-distributed and distributed directory, and both Ext4 and ReiserFS, due to the reduction in network traffic.

Batchops obtain the best improvements in the stat test. For the non-distributed directory and Ext4, batchops improve the throughput by 34 % at least, and by 44 % with ReiserFS. In the case of the distributed directory and Ext4, batchops achieve a maximum improvement of 36 %, and with ReiserFS the improvement reaches a 40 %. Since this is a read-only test, which reads related directory entries and i-nodes, batchops allow servers to make a better use of caches and prefetching, because they process many requests in a row.

Finally, results in the unlink test are similar to those in Sect. 5.2.2 where Ext4 performs better than ReiserFS as the number of OSD+s increases. For the distributed directory, Ext4 achieves a 40 % of improvement with 8 OSD+s, while ReiserFS gets 16 %. For Ext4 and the non-distributed directory the improvement is around 30 % and, in the case of ReiserFS, the improvement is around 25 %.

6 Related work

To the best of our knowledge, batchops have not been proposed in the parallel/distributed file systems field, although some network file systems support similar ideas. For instance, NFSv4 [30] reduces latency for multiple operations by bundling different RPC calls into a single request. Operations lookup, open, read and close, for example, can be sent once over the wire, and the server can execute the entire compound call as a single entity. Version 2 of the Server Message Block (SMB2) [32] is also able to send an arbitrary set of commands in a single request, thereby improving the performance by reducing the number of network round-trips. The compounding ability in SMB2 is very flexible; commands packed in a single request can be unrelated (executed separately, potentially in parallel) or related (executed in sequence, with the output of one command available to the next); responses can also be compounded or sent separately. Note that, for both network file systems, there exists a single server. Our approach, however, allows batch operations in a distributed multi-server environment.

Ideas similar to batchops have been used in many other different areas. For instance, Linux kernel 3.14 [50] includes a feature, called automatic TCP corking, to help applications to do small write()/sendmsg() on TCP sockets. Previous versions of Linux also allow the use TCP corking, although Linux kernel 3.14 is the first one to include automatic TCP corking. This feature allows to delay the dispatch of messages in a socket to coalesce more bytes in the same packet, thereby lowering the total amount of sent packets. However note that it is not a proper batch operation. This technique complements batchops, although a study of performance of both is postponed to the future. Also in the network area, IX [5] is an operating system that uses an adaptive batching in every stage of its network stack to improve performance on congestion. IX batches network requests in the presence of network congestion and allows application threads to issue batched system calls. Tyche [21] is a network storage protocol over raw Ethernet that uses an adaptive batching mechanism to achieve high link utilization under high degrees of I/O concurrency and small I/O requests. Tyche proposes a dynamic technique that varies the degree of batching depending on the throughput achieved. In Tyche, a batch message is composed of several I/O requests, reads or writes, issued by the same or different application threads.

Similarly, but in the grid computing area, Chervenak et al. [10] use what they call bulk operations in the implementation of a Replica Location Service (RLS). RLS provides a mechanism for registering the existence of replicas and discovering them within a grid environment. They store catalogs that map logical names to target names. In turn, clients send queries to the servers to discover replicas associated with a logical name. Among the operations they support, they include bulk operations to add/delete entries and/or attributes to the catalogs, and to perform query operations on them. They include 1000 requests per bulk operation. Their experiments show a significant performance improvement for a single client. However, as the number of clients increases, the performance advantage of bulk queries decreases. We obtain a similar behavior in our experimental results, although our improvement with batchops versus regular operations is much higher than theirs, and batchops still provide noticeable benefits with a large number of clients.

OpenStack Swift also includes in its Object Storage API two bulk operations: delete [38] and archive extraction [37]. Bulk delete can remove up to 10,000 objects or containers (configurable) in one request. The archive extraction allows to expand a tar file into a Swift account in a single request. Only regular files are uploaded; empty directories, symlinks, etc. are not uploaded.

Another area where reducing the number of requests is especially useful is Internet. The next major version of HTTP (HTTP/2 [6]) will use bulk operations to accelerate communications. Currently, services like Google or Facebook also try to reduce the number of HTTP requests by batching operations together. Google [27] uses batch requests in Google Base [24], Google Spreadsheet [23], Google Calendar [25] and Google Cloud Storage API [26]. Specifically, the Google Cloud Storage API provides with batch requests to bundle API calls together and reduce the number of HTTP connections clients have to make. In a similar vein, Facebook provides its Ads API [15] and Graphics API [16] with batch requests to send several requests of the same type in a single HTTP request. Depending on the type of operation, the maximum number of requests per batch operation varies.

7 Conclusions

In distributed or parallel file systems, workloads that perform the same operation on multiple files, such as the migration of a directory, the creation of a set of files in a directory, or the removal of all the files in a directory, usually incur in large amounts of network traffic. To deal with these workloads in a more efficient way, we present the design and implementation of operations that embed hundreds or thousands of operations of the same type into a single message. These operations are possible in FPFS because its namespace distribution is based on directories, which usually contain related files. With these operations, that we call batchops, we significantly reduce the amount of network messages and, therefore, network delays and round-trips. We also manage to reduce the overall network congestion, making a better use of the available I/O and processing resources.

We add the management of batchops to FPFS by including specific operations to create (openv and closev), get the status (statv) and unlink (unlinkv) files in a batch fashion. For each operation, we modify the message format to include a list of entries within the same directory. Our batch operations include semantics to specify the behavior in case of failure of an operation in the batchop. The implementation also supports huge directories in a transparent way; clients do not need to differentiate between distributed and non-distributed directories when issuing batchops.

The experiments show that batchops help us to reduce the network overhead, and increment the number of operations/s in OSD+s, improving FPFS performance. Specifically, in tests that make a more intensive use of the network, such as the creation of a single shared directory, performance improves by a 50 % at least, reaching a 100 % in some cases. In the case of stat, the improvement is always around 25 %. Finally, for the unlink test, which issues both read and write requests, the backend file systems determine results to a large extent, being Ext4 the one that better leverages batchops with an improvement of 60 %, while ReiserFS obtains a 23 % when using this kind of operations.

Thanks to batchops, FPFS can create, stat and delete around 200,000, 300,000 and 200,000 files per second, respectively, with just 8 SSD-OSD+ devices and a regular Gigabyte network.

Finally, our experiments also show that, while batchops are usually beneficial with SSD-OSD+s, there are some cases where they downgrade the performance when HDD-OSD+ devices are used. The problem is that batchops affect the way files are allocated on disk. For HDD-OSD+s, this different layout increases the I/O time in some cases due to more head seeks and less efficient disk caches.

Although some common file operations can already take advantage of batchops (e.g., ls -l and rm -rf), as future work, we plan to identify specific HPC applications and scenarios that can benefits from our proposal.