Keywords

1 Introduction

Recent advancements in computer vision using Convolutional Neural Networks (CNNs) have emphasized their ability to classify images and detect objects within images [16]. However, these tasks require significant computational resources and time to complete [25]. Fortunately, Graphics Processing Units (GPUs) are more capable of handling these advanced computations than Central Processing Units (CPUs) [9, 27]. For example, models trained in this work require 20–30 min to complete on a GPU but take 10–14 h on a CPU. GPUs were originally configured to work with graphics. However, as researchers showed more interest in using GPUs for computations associated with machine learning, NVIDIA (the most commonly used GPU manufacturer for machine learning [20]) created the CUDA Library [12], thus enabling the use of GPUs for diverse machine learning tasks. Unfortunately, NVIDIA’s GPU and CUDA library introduce nondeterminism reflected in two or more identically trained models sometimes producing different results. This GPU-related nondeterminism is distinct from the nondeterminism due to randomness embedded in the model structure by features such as Stochastic Data Augmentation and Stochastic Weight Initialization. The existence and impact of nondeterminism related to both the randomness embedded in the model structure and the GPU have gained increasing attention [20], and deserve continuous assessment and research to reduce it.

1.1 Motivation

This work is inspired and motivated by previous research [22] using a Mask R-CNN [28] to analyze metallic powder particles and detect deformations on the surface. Since these deformations known as satellites, impact the usability of metallic powders, accurate detection is very important. In the previous work, using a Mask R-CNN led to accurate detection, even on a diverse dataset composed of multiple powder types taken at varying magnification settings. However, upon deep analysis of results looking for determinism, it was discovered that, in some cases, two or more identically trained models could produce significantly different results. Figure 1, depicting two identically trained models, labeled Model A and B, highlights these differing results due to nondeterminism. These models were specifically selected to illustrate potential variation in outputs. In Fig. 1, the outlined green section highlights a small satellite detected in Fig. 1a more than tripling in size and losing its discernible shape in Fig. 1b. Similarly, in the section outlined red, two particles correctly identified to have no satellites in Fig. 1a are misidentified as satellites in Fig. 1b. Inspired by the nondeterminism causing these variations, this manuscript aims to quantify its impact and provide viable options to reduce or remove it to ensure replicability of experimental results.

1.2 Terminology

To avoid confusion due to various existing definitions, for the purpose of this work, determinism and nondeterminism are defined as follows: (1) An algorithm is said to be “determinable” if its current state uniquely determines its next state. Simply put, an algorithm at any state should produce exactly one output. (2) An algorithm is said to be “nondeterminable” if, at a given state, multiple potential outputs are possible given the same input [14]. In the context of this paper, if all models trained within a given environment are identical, that training environment is determinable.

1.3 Contributions

The ability to replicate results is quintessential to research [10]. Thus, being able to eliminate, or at least reduce nondeterminism in a Mask R-CNN is imperative. The contributions of this work are multi-fold:

  1. 1.

    Identifying and evaluating the causes and extent of nondeterminism in Mask R-CNN models with embedded randomness trained on an NVIDIA GPU.

  2. 2.

    Evaluating the extent of nondeterminism in Mask R-CNN models with no embedded randomness trained on an NVIDIA GPU.

  3. 3.

    Offering a simple method, requiring only eight additional lines of code, to achieve pure deterministic results through a combination of using a CPU and specific training configurations.

Fig. 1.
figure 1

Example of variation of performance in identically trained Mask R-CNN models as a result of nondeterminism (Color figure online)

2 Nondeterminism Introduced in Training

To measure nondeterminism in model training caused by GPUs, all other sources of nondeterminism must first be eliminated. Through rigorous examination of literature, documentation, and user manuals of the varying tools and packages [12, 21, 28], the following have been identified as potential sources of nondeterminism embedded in the model: Random Number Generators (used by Python Random Library, PyTorch, NumPy, and Detectron2), Detectron2 Augmentation Settings, and the PyTorch implementation of CUDA Algorithms. Figure 2 illustrates the general sources of nondeterminism that may be present in a Mask R-CNN, as well as the components of each source. The following subsections give some background information on each of these sources.

Fig. 2.
figure 2

General sources of nondeterminism that may be found in Mask R-CNNs (*not present in this work, but may be present in other implementations)

2.1 Random Number Generators

The training of a CNN employs randomness for large-scale computations to reduce training time and prevent bottlenecks [5, 29]. Each instance of embedded randomness is enabled by a Pseudo-Random Number Generator (PRNG) that generates sequences of numbers designed to mimic randomness. Mersenne Twister (MT) [19] is one of the most frequently used PRNG algorithms by tools such as the Python Random Library [26]. MT simulates randomness by using the system time to select the starting index or seed in the sequence of numbers when a PRNG is created [26]. Without a set seed, each PRNGs starts at a unique index, leading to different outputs and introducing nondeterminism in training.

2.2 Model Structure

This model structure is configured by Detectron2 [28], which uses a fairly common training technique called Stochastic Data Augmentation to randomly mirror images prior to training [23]. The stochastic process of selection increases the nondeterminism in training. Augmentation is the only source of nondeterminism caused by the model structure here. However, this may not always be the case.

2.3 CUDA Algorithms and PyTorch

The PyTorch implementation of the CUDA Library [21], by default, contains two settings that increase nondeterminism in training. First, CUDA uses Benchmark Testing to select optimal algorithms for the given environment. However, as indicated in the documentation [21], this testing is “susceptible to noise and has the potential to select different algorithms, even on the same hardware.” Second, by default, the library chooses nondeterminable algorithms for computing convolutions instead of their determinable counterparts. These nondeterminable algorithms are selected because they simplify computations by estimating randomly selected values instead of computing exact values for each layer [6]. Both configurations increase the nondeterminism present in model training.

3 Nondeterminism Introduced by Hardware

3.1 Floating-Point Operations

Many computer systems use floating-point numbers for arithmetic operations; however, these operations have a finite precision that cannot be maintained with exceptionally large or small numbers. In this work, values were stored using the IEEE 754 Single Precision standard [18]. Unfortunately, due to the finite precision of floating-point numbers, some calculations are approximated, causing a rounding error and rendering the associative property not applicable in floating-point operations [15]. Equations 1 and 2 provide an example in which the non-associativity of floating-points impacts the final result. In the intermediate sum in Equation 1, the 1 is rounded off when summed with 10\(^{100}\), causing it to be lost in approximation. When computing the difference after rounding, 10\(^{100}\)–10\(^{100}\) returns 0. By contrast, if 10\(^{100}\)–10\(^{100}\) is performed first, the result of computations will be 0 at the intermediate step, and when summed with 1, will return the correct value of 1. In summary, due to the non-associativity of floating-points, the order in which operations are executed impacts the outputs. This becomes increasingly relevant when parallel computing is implemented, as further elaborated in Sects. 3.2 and 3.3.

figure a

3.2 Atomic Operations

Shared memory is commonly implemented in parallel computing [3]. However, when multiple operations access the same location in memory at similar times, depending on when read and write methods are called, data can be “lost” due to overlapping operations [11]. Atomic Operations resolve this by performing a read and write call as an atomic action and preventing other operations from accessing or editing that location in memory until completed. Atomic operations are designed to ensure memory consistency but are unconcerned with the completion order consistency [11]. Effects of this are noted in the CUDA Library [12], stating “the following routines do not guarantee reproducibility across runs, even on the same architecture, because they use atomic operations” in reference to a list of algorithms used in convolutions.

3.3 Parallel Structure

With the introduction of the CUDA library, taking advantage of the benefits of parallel computing with GPUs became easier [8] and more frequently used [13]. Despite these benefits, there are inherent drawbacks to most multi-core or multi-threaded approaches. In parallel computing, large computations are broken into smaller ones and delegated to parallel cores/threads. Each sub-task has a variable completion time, which is amplified by the use of atomic operations. When considering the variable completion time of various tasks and the non-associativity of floating-point operations, it is not surprising that GPUs introduce nondeterminism. Figure 3 illustrates how a slight variation in the completion order of sub-tasks can lead to nondeterminable results due to floating-point operations. Figures 3a and 3b depict the process sum() adding sub-functions (labeled F1 to F5) together, in which each sub-function is dispatched to its own core/thread to be individually computed. The output (labeled O1 to O5) is collected in order of completion and summed. However, since completion order is not guaranteed, these outputs can be collected in different orders, resulting in differing output despite having identical inputs and hardware because of floating points non-associativity. [15].

Fig. 3.
figure 3

Tracing the impact of variable completion times in parallel structures using floating-point operations

4 Experimental Setup

This work used the Detectron2 implementation of the Mask R-CNN with PyTorch (v1.8.1) [21] and CUDA (v10.2) [12]. Initial weights were pulled from the Detectron2 Model Zoo Library to remove any variation in weight initialization. The dataset used here is the same dataset used in [22], consisting of images of metallic powder particles collected from a Scanning Electron Microscope. It contains 1,384 satellite annotations across six powders and five magnifications and was separated using an 80:20 ratio between training and validation datasets.

4.1 System Architecture

To get a benchmark for variation in performance caused by embedded randomness in the model structure and the GPU, 120 models were trained using an NVIDIA V100 GPU. Of these models, 60 were left non-configured and 60 were configured, as shown in the source code [2], such that all embedded randomness within the model structure was disabled. This ensured that any nondeterminism present after configuring models was induced solely by the GPU. This experiment was then replicated using CPUs. However, due to the high time difference in training between GPUs and CPUs, a 5-Fold Cross Validation [4] was used instead of training 60 models. For example, based on results from the ten models trained on the CPU, training 120 models on a CPU would have taken between 50 and 70 days of computational time, instead of 48 h on a GPU. The 5-Fold Cross Validation was used only to evaluate if results were determinable over multiple iterations; due to the small number of data points it was not used to evaluate the extent of nondeterminism.

4.2 Measuring Performance

Identical to previous work [22], performance was measured by computing precision and recall, as defined in Equations 3 and 4. For every image in the validation set, each pixel was classified as True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) depending on its true value and the predicted output. Once these scores were computed for each image, they were averaged across all images in the validation set to get a final score for that model to be compared. Nondeterminism was evaluated by analyzing the average, standard deviation, and spread of each performance metric collected by models trained with identical configuration settings. If the training configurations and hardware are determinable, precision and recall will be identical for all models trained.

figure b

4.3 Model Training Process

Previous work [22] discovered that in most cases, training beyond 10,000 iterations had little impact on performance. As a result, in an effort to prevent underfitting or overfitting, all models were trained to 10,000 iterations. Additionally, to prevent introducing any bias, hyperparameters were left at their default values. All calculations were completed in batch jobs dispatched to private nodes on Bridges2 [7], a High-Performance Computer (HPC) operated by Pittsburgh Supercomputing Center and funded by the National Science Foundation. Each node contained two NVIDIA Tesla V100 GPUs and two Intel Xeon Platinum 8168 CPUs.

4.4 Configuring Settings

To compare models with embedded randomness enabled and respectively disabled, specific configurations had to be set. Table 1, depicting configuration settings, shows the value of each configuration for models with and without embedded randomness. Configuring an RNG’s seed only changes the starting index and has no further impact on the randomness [19]. As a result, so long as the seed remains constant, its specific value is arbitrary. Evidence for this is found in the Detectron2 Source Code stating the seed needs to be “any positive integer” and in the NVIDIA Determinism Repo stating “123, or whatever you choose” [1, 28]. In light of this, all seeds were arbitrarily set to “42.” After reviewing the CUDA Toolkit Documentation [12], the PyTorch Documentation [21], the Detectron2 Source Code [28], and NVIDIA Determinism Repository [1], the only possibility for achieving reproducible results that was not implemented was the PyTorch DataLoader. This was not configured because Detectron2 implements its own custom DataLoader class and the PyTorch version was not used.

Table 1. Configuration values for non-configured and fully configured models

5 Experimental Results

5.1 Data Collected from Models Trained on GPU

After training 120 models on a GPU (60 non-configured and 60 fully configured), regardless of configuration settings, there was clear evidence of nondeterminism. Table 2 shows all performance metrics gathered from models trained on a GPU for comparison, but attention will be drawn specifically to the standard deviation of precision and recall values (bolded and marked * and ** respectively). As can be seen in Table 2, configuring the embedded randomness in the model decreased the standard deviation of precision values by 1% (marked *) and recall by 0.1% (marked **). Despite the 1% reduction in variation of precision, only 25% of the nondeterminism is eliminated, leaving a remaining 3.1% standard deviation caused by the GPU. Figure 4 shows the distributions of precision values for non-configured and fully configured models. As can be seen, the distribution of Fig. 4a, corresponding to non-configured models, has a larger spread of data points than in Fig. 4b, corresponding to fully configured models.

Table 2. Performance metrics for non-configured and fully configured models
Fig. 4.
figure 4

Comparative results of the distribution of precision values collected from non-configured models (a) and fully configured models (b) trained on a GPU

Table 3. Performance metrics for non-configured models trained on CPUs

5.2 Data Collected from Models Trained on CPU

Since only five models were trained on a CPU instead of 60 due to the expected very large training time as previously discussed, the presence or absence of nondeterminism can be observed but not quantified. Models trained on a CPU with all embedded randomness disabled produced perfectly determinable results. These results were identical up to the 16th decimal place (only measured to 16 decimal places) with a precision score and recall of approximately 76.2% and 56.2%, respectively. Among these models, there was a minimum training time of 606.15 min, a maximum of 842.05 min, and an average of 712.93 min. The variety in training times had no impact on the accuracy of the model. In contrast to every trained model with no embedded randomness on the CPU having identical precision and recall score, when embedded randomness was enabled, there wasn’t a single duplicated value. As shown in Table 3 depicting a comparison of precision and recall scores of non-configured models trained on a CPU, each model produced quite different results, showing that nondeterminism is present. As a result of nondeterminism being present in CPU trainings when embedded randomness is enabled, nondetemrinism can, in part, be attributed to embedded randomness in the model.

6 Discussion of Results

6.1 Impact of Embedded Randomness on Model Precision

As previously shown, randomness is deliberately embedded in machine learning models to improve their generalizability and robustness [17]. By eliminating the embedded randomness within the model, there is an associated reduction in the ability of the model to generalize for samples of data with more variation than those within the training set. In context, by decreasing the randomness embedded in the model structure during training, the model’s ability to handle formations of satellites not included in the training set may decrease. This could explain why the average precision and recall values were lower in the fully configured model, and why there was a reduced number of models with precision scores above 75% out of the set of fully configured models (4 models) compared to the non-configured models (17 models). In summary, by disabling embedded randomness, the model may be less capable of handling new data, and as a result, it may be less generalizable.

6.2 Increase in Training Time After Configuring Randomness

Even though the reduction in performance variation was about 25% after disabling the randomness embedded in the model structure, the training time increased by nearly 50%. Non-Configured models took on average 19.5 min to train on a GPU, which rose to 28.8 min after configuring the embedded randomness. This increase was theorized to be the result of forcing CUDA algorithms to be determinable instead of their nondeterminable counterparts. To test this, 40 models were trained with all embedded randomness disabled except Determinable Algorithms. With these parameters, the models had nearly identical precision and recall scores to a fully configured model with an average score of 71.966% and 60.207% respectively, and standard deviations of 3.455% and 2.354%. However, the average training time decreased to 19.1 min with a standard deviation of 0.130 min, much closer to that of the non-configured models. As a result, since forcing determinable algorithms has a minimal impact on the variation but increases the training time by approximately 50%, it is suggested to allow nondeterminable algorithms when response time is a priority.

6.3 Impact of Seed Sensitivity

By disabling embedded randomness within the model structure, there was little adverse impact on performance. Between non-configured models and fully configured models on the GPU, precision and recall were reduced on average by 2% and 0.6%, respectively. Since each seed outputs different values than another seed and slightly impact performance, the model is seed-sensitive [24]. In this case, the seed was arbitrarily set to “42.” However, other seed values may produce different results. Thus, if hyperparameter tuning is being performed with a configured seed, users may consider testing multiple seed values to identify which works best for the given dataset and parameters.

6.4 Conclusion

The methods and procedures highlighted in this manuscript aim to inform the selection process of parameters and hardware for training a Mask R-CNN model with respect to nondeterminism and training time. In cases where determinable results are of a priority, model training can be performed on a CPU with the embedded randomness in the model structure configured. This will guarantee fully determinable results and only requires an additional eight lines of code. These configurations can be found in the training files for the repository associated with this manuscript [2]. Unfortunately, by running computations on a CPU instead of a GPU, the training time increases from 20–30 min to 10–14 h. As a result, a CPU should only be used in cases where computational resources are not a concern and replicability is more important than speed and efficiency. If determinable results are not the first priority, in most cases performing training on a GPU is a better choice. However, in addition to the reductions in training time accomplished by using a GPU (at least 20 times faster), the nondeterminism present during model training will increase. Here, the standard deviation of this variation in non-configured models was approximately 4.2% and 2.7% for precision and recall respectively. Using the methods established above, this variation can be reduced to approximately 3.1% and 2.6% for precision and recall while still performing computations on a GPU. Each scenario will have different priorities, but this work can be used as a guide for configuring a training environment with respect to nondeterminism and training time.