1 Introduction

Recent years have witnessed a growing research and application of data augmentation in various domains, such as image classification [1, 2], face recognition [3, 4], moving object detection [5, 6] and text mining [7, 8]. While machine learning techniques excel in various tasks, they often struggle with variations in data distribution [9]. Furthermore, data acquisition and labeling are manual and limited. Relying solely on raw images for training models, many deep neural networks tend to memorize the data, hindering their ability to learn domain invariance and ultimately leading to poor generalization.

To address above issues, a range of data augmentation techniques have been proposed. For visual information, traditional data augmentations [1, 10,11,12] apply various affine transformations, such as horizontal translation, scaling, and squeezing, to enhance the accuracy and robustness of models. In contrast to these global transformations, some patch-level methods [13,14,15] primarily aim to remove or add noise to patches. However, these operations only perform conservative processing, leaving much of the internal information unexplored. Furthermore, in the field of visual research at the patch-level, there has been a growing trend to consider images as collections of patches. Many visual models are trained directly on patches as input and have demonstrated strong performance [16, 17], highlighting the feasibility of using patches for data representation. Additionally, employing puzzle-like techniques to split and reassemble patches for unsupervised feature learning have shown promising capabilities in knowledge learning and transfer [18]. Therefore, investigating augmentation methods specifically targeting patches holds great potential.

Recently, a line of research known as mixup [19] has been proposed. These methods [20,21,22,23,24,25] focus on linear or nonlinear mixing of multiple images and labels, allowing the model to better fit multicategory information from different images. Anchor data augmentation [26] and C-mixup [27] have also been proposed to address regression problems. Nevertheless, these derived works introduce additional computational effort and cannot be generalized to unlabeled scenarios such as self-supervised learning [28] due to their strong coupling with labels. Mixup provides a novel approach of linearly combining data, yet in existing research, it has not been applied within an image to enrich the internal information.

In this work, we propose an innovative data augmentation approach called PatchMix. Utilizing patches as the fundamental building blocks, PatchMix develops a blending strategy inspired by mixup to reorganize the image data. As depicted in Fig. 1, PatchMix significantly expands the diversity of the data representation within the domain while preserving its fundamental characteristics. In comparison with existing methods, PatchMix significantly enhances the relative positional information, combination and clarity through cropping, combining and blurring. Label-free and plug-and-play, it can also be seamlessly integrated into existing works and even extended to other domains such as self-supervised learning. To enhance the robustness of the trained model, we extend PatchMix to PatchMix-R, which alleviates the limitations of PatchShuffle [29] by introducing reasonable perturbations to adjacent pixels.

Fig. 1
figure 1

A visual comparison of the patch-level methods. Left: Input and traditional patch-based methods. Middle: Our proposed method PatchMix. The notation 2 \(\times \) 2 represents dividing an image into 2 \(\times \) 2 patches, and each patch is undergone PatchMix. Auxiliary lines are added to the samples in the second column. Right: PatchShuffle and our proposed method PatchMix-R. The notation 8 \(\times \) 8 means that each non-overlapping patch contains 8 \(\times \) 8 pixels, and PatchMix-R is performed on each pixel within the patch. In the left column, only half of patches are processed, while all patches in the images on the right are processed

Our main contributions are three folds:

  • We design a lightweight patch-level data augmentation PatchMix for images, which can ensemble with other augmentation methods of varying complexities and visualized to showcase the advantages of training with it.

  • We apply PatchMix within patches and propose PatchMix-R. By moderately perturbing the adjacent pixels of the patch, it visibly enhances the robustness of models to noisy samples.

  • In the experiments, PatchMix outperforms state-of-the-art methods on CIFAR-10/100 and Tiny-ImageNet. When combined with PatchMix-R, it also improves the robustness of classifiers to noise attacks better than other methods.

2 Related work

2.1 Traditional data augmentation

Traditional data augmentation typically targets individual images, performing basic geometric transformations and color conversions [30]. Based on the content of the augmented images, the traditional approaches can be categorized into image-level and patch-level techniques.

Image-level Random flipping and random cropping, as the most common and effective image-level data augmentation, empirically improve the generalization performance of the neural networks on clean data. Moreover, methods such as sharpness, brightness and Gaussian are also utilized for image augmentation in various works [1, 10, 11]. Subsequently, attention has shifted toward combining different augmentation methods, leading to the proposal of automatic data augmentations [2], such as AutoAugment [31], RandAugment [32], and TrivialAugment [12].

Patch-level Patch-level data augmentation focuses more on the local information of the image. Cutout [13] randomly masks a portion of the image to enhance the accuracy. To achieve a balance between accuracy and robustness, Patch Gaussian [14] adds Gaussian noise to a specific patch. Parameter learning free, Random erasing [15] generates varying levels of occlusion in training images and is easily deployed.

Traditional data augmentation, although simple, does not considerably alter the representation of the data domain. The patch-level methods are specific to one or several patches, conservatively preserving the rough structure of the image, which somewhat limits the diversity of its feature representation.

2.2 Mixup

Linear mixing Mixup [19] trains a neural network to linearly interpolate two training images and their corresponding labels in a random ratio. This regularization-like method has been experimentally shown to improve the accuracy of the model. Manifold Mixup [20] improves upon the Mixup by introducing interpolation at the hidden layer to preserve the manifold structure between input samples. To encourage fair and accurate decision boundaries for all subgroups, Subgroup Mixup [21] develops a pairwise mixup scheme to augment training data.

Nonlinear mixing Instead of processing the entire image, CutMix [22] randomly crops the image and fills the masked patch with another image patch, mixing the labels according to the proportion of patches. PuzzleMix [23] and SaliencyMix [24] incorporate a saliency signal, rendering the selection and mixing of patches non-random. The mixing proportion of the labels is optimized to be determined by the ratio of salient regions. To reduce the loss of saliency inference, AutoMix [25] adaptively generates mixed samples based on mixing ratios and feature maps in an end-to-end manner.

Although many derivative works have improved the reliable and rich representation of samples, the additional computational costs are usually not negligible. Simultaneously, the mixing operation based on multiple samples poses a challenging problem for label calculation. It is worth noting that existing mixup methods are all based on multiple images, and there is no mixup technique available for a single image yet.

3 Proposed method

In this section, we present the general procedure of the proposed method, PatchMix, and examine the mechanisms for controlling the degree of mixing between patches. Moreover, we extend PatchMix to enhance the robustness of the network by building upon the idea of PatchShuffle [29].

Fig. 2
figure 2

An illustration of PatchMix on one image \((H \times W)\) divided into four non-overlapping patches \( \left( \left( H/2 \right) \times \left( W/2 \right) \right) \). The notation \(\odot \) indicates that the pixel values of each patch are multiplied by the corresponding weight values in the right column. The shuffled patches are independent of the original patches. Note that there is a possibility that several patches will directly skip to the final, as illustrated in the bottom-most patch

3.1 PatchMix

At present, data augmentation techniques predominantly concentrate on operations that involve merging or stitching multiple images [33, 34]. These methods frequently disregard the diversity of feature representations within individual images, depending instead on basic preprocessing steps such as horizontal flipping or random cropping.

In the context of jigsaw puzzle, an interesting observation is that a picture, when disassembled and reassembled, can be orderly reconstructed due to its key features. Inspired by this concept, PatchMix initially segments an image into distinct patches and applies random shuffling. This approach overcomes the limitations associated with relative positional constraints of patches, thereby creating opportunities for diverse information expression. However, PatchMix does not merely substitute the original patch; instead, it utilizes the linear method of mixup to blend patches, mitigating the structural disruptions caused by the shuffling. In this section, we provide a detailed explanation of PatchMix and how it works.

3.1.1 Formulations

Consider a matrix \(\varvec{X}\) of dimensions \(H \times W\). Divide \(\varvec{X}\) into non-overlapping patches of \(h \times w\) elements, represented as

$$\begin{aligned} \varvec{X} = \left( \begin{array}{cccc} P_{1} &{} P_{2} &{} \cdots &{} P_{\frac{W}{w}} \\ P_{\left( \frac{W}{w}\right) +1} &{}\qquad P_{\left( \frac{W}{w}\right) +2} &{}\qquad \cdots &{} P_{2\times \left( \frac{W}{w}\right) } \\ \vdots &{}\qquad \vdots &{}\qquad \ddots &{}\qquad \vdots \\ P_{\left( \frac{H}{h}-1\right) \times \left( \frac{W}{w}\right) +1} &{}\qquad P_{\left( \frac{H}{h}-1\right) \times \left( \frac{W}{w}\right) +2} &{}\qquad \cdots &{}\qquad P_{\left( \frac{H}{h}\right) \times \left( \frac{W}{w}\right) } \end{array}\right) , \end{aligned}$$
(1)

where \(P_{i}\) represents the i-th patch after being split. A random binary switch r determines whether the matrix \(P_{i}\) undergoes the PatchMix transformation. Let the random variable r follow a Bernoulli distribution, \(r \sim Bernoulli(\epsilon )\), such that \(r = 1\) with probability \(\epsilon \) and \(r = 0\) with probability \(1 - \epsilon \). The resulting matrix \(\tilde{P_{i}}\) can be expressed as

$$\begin{aligned} \tilde{P_{i}} = (1-r) P_{i} +r T(P_{i}), \end{aligned}$$
(2)

where \(T(\cdot )\) denotes the PatchMix transformation and is formulated as

$$\begin{aligned} {T(P_{i})} = \lambda {P}_{i}+(1-\lambda ) {P}_{\text{ index }[i]}, \end{aligned}$$
(3)

where index is a randomly generated sequence number obtained by shuffling \(\left[ \begin{matrix} 1, (H/h)\times (W/w) \end{matrix}\right] \), and the mixing ratio \(\lambda \) is randomly sampled from a beta distribution.

3.1.2 PatchMix on images

In the process of visual perception, fragments of an object can supply valuable and sufficient information for classification, without the need to consider the entire object or rely on absolute positional relationships [35]. Similarly, in many computer vision tasks, input images can be treated as matrices and subjected to the PatchMix transformation. As depicted in Fig. 2, The image is divided into \(P_{1}\), \(P_{2}\), \(P_{3}\), \(P_{4}\) in accordance with Eq. (1). \(P_{4}\) retains the original patch values, while \(P_{1}\), \(P_{2}\), \(P_{3}\) are mixed based on Eq. (3). After mixing, all patches are reassembled into the image according to the original sequence. The specific processing details are provided in Algorithm 1.

Algorithm 1
figure a

PatchMix Procedure

This paper simultaneously elucidates the benefits of our method in adjusting the classification boundary. In Fig. 3, when the training data are limited, classification boundary is prone to overfitting during the training. In such cases, test data are typically not well distinguished. Traditional data augmentation methods (random flipping and random cropping) can to some extent expand the input space of the data, thereby reducing the occurrence of overfitting. PatchMix further expands the representation of image data by manipulating the information through cropping, combining, blurring, etc. The generated data are typically more complex and diverse. Thus, the trained model is able to better fit the features of different classes of data, thereby adjusting the classification boundary for improved discrimination.

Fig. 3
figure 3

Classification boundary adjustment by applying data augmentation. Left: Classification boundary learned only by input data. Middle: Classification boundary adjusted by adding augmented data with traditional methods, random flipping (RF) and random cropping (RC). Right: Classification boundary adjusted by adding augmented data with RF, RC and PatchMix. Partial augmentation processes are visualized in the bottom

Fig. 4
figure 4

The mixing effects of PatchMix under different beta distributions. In each subfigure, the left displays the curve shape of the probability density function under different \(\alpha \) and \(\beta \), while the right shows the process of mixing two patches after determining the mixing ratio. a \(\alpha \) = 0.2, \(\beta \) = 0.2: the mixed patch leans toward either the original patch or the shuffled patch; b \(\alpha \) = 2.0, \(\beta \) = 2.0: the mixed patch leans toward a fusion of the original patch and the shuffled patch; c \(\alpha \) = 1.0, \(\beta \) = 0.2: the mixed patch leans toward the shuffled patch, indicating the addition of significant perturbation; d \(\alpha \) = 0.2, \(\beta \) = 1.0: the mixed patch leans toward the original patch, signifying minimal perturbation

3.1.3 Mixing control

In PatchMix, the blending of patches is controlled by associated hyperparameters, including the patch-mixed probability p, the size of divided patch \(h \times w\) and the parameters of the beta distribution \(\alpha ,\beta \). Modifying these parameters enables the control of the mixing ratio, size, and degree, thus helping to mitigate the problem of excessive mixing. While PatchMix serves to enhance the richness of image information, it frequently engenders heightened training complexity for the network. Therefore, we believe that for complex datasets, such as those sensitive to spatial information, it is often necessary to reduce the mixing ratio or degree to ensure that augmentation remains within an appropriate range. In this paper conducts experimental comparisons on different datasets, and the considerations for selecting hyperparameters were mentioned in the section of ablation study. In the following, we primarily investigate the impact of the beta distribution with varying parameter values on the mixture.

To ensure a wide-ranging set of training samples, it is crucial to strike a balance between blending information from different patches and preserving the integrity of certain patches. The beta distribution, a continuous probability distribution ranging from 0 to 1, serves as a useful tool for this purpose. As illustrated in Fig. 4, the probability density function of the beta distribution is governed by parameters \(\alpha ,\beta \), which control the mixing proportion between different patches. To generate differing training samples, an effective strategy is to increase the perturbation of the original patch. In contrast, if preserving the overall structure is of paramount importance, assigning a larger weight to the original patch may be more appropriate. Consequently, the optimal values for \(\alpha ,\beta \) can be chosen based on the specific augmentation task at hand.

3.2 PatchMix-R

In relation to data augmentation, robustness against perturbations is of particular importance [36, 37]. PatchShuffle [29] has been demonstrated to be an effective method for enhancing robustness by swapping the positions of adjacent pixels. Inspired by this idea, we consider applying PatchMix within patches, replacing the original random replacement with a more reasonable mixing method.

3.2.1 Formulations

PatchMix-R, similar to Eqs. (12), also carries out block operations initially. However, unlike Eq. (3), it mixes nearby pixels within the patch. To accomplish this, each pixel in the patch is sequentially numbered from 1 to \(h \times w\). Let \({p}_{{j}}^{i}\) represents the j-th pixel in patch \({P}_{i}\). PatchMix-R can be expressed as follows,

$$\begin{aligned} \tilde{p}_{j}^{i} = \lambda {p}_{{j}}^{i}+(1-\lambda ) {p}_{\text{ index }[j]}^{i}, \end{aligned}$$
(4)

where index is a randomly generated sequence number obtained by shuffling \(\left[ \begin{matrix} 1, h \times w \end{matrix}\right] \), and the mixing ratio \(\lambda \) is also sampled from a beta distribution.

3.2.2 PatchMix-R on images

Assuming the image and patch sizes are \(224 \times 224\) pixels and \(4 \times 4\) pixels, respectively, the original image is divided into \(56 \times 56\) patches. For each patch, PatchMix-R shuffles and mixes 16 pixels according to Eq. (4). Figure 5 illustrates the effects of PatchMix-R and PatchShuffle on different patch sizes. In comparison with PatchShuffle, PatchMix-R enhances the image through mixup, resulting in a more balanced and reasonable pixel transition. In terms of preserving the overall structure, as well as the handling of details and textures, PatchMix-R perturbs the image in a manner that better conforms to the original pixel distribution. PatchMix-R has been proven in the experimental part to effectively improve the robustness.

Fig. 5
figure 5

Examples of PatchShuffle (PS) and PatchMix-R (PM-R). The patch sizes are 2 \(\times \) 2 pixels (left) and 4 \(\times \) 4 pixels (right). Image selection was based on overall contrast (I, III), detail (IV, VI) and texture (II, V). Images on the bottom row are zoomed-in regions

4 Experiment

4.1 Implementation details

Datasets Several open-source image classification datasets are used in our experiment, including CIFAR-10/100 [38] and Tiny-ImageNet [39]. The CIFAR-10 dataset contains 6000 images with a resolution of 32 \(\times \) 32, featuring different classes such as airplane, automobile, bird and more. The CIFAR-100 dataset expands the number of classes to 100, with each class having 600 images. The Tiny-ImageNet, another popular dataset, consists of 200 classes, each with 500 training images, 50 validation images and 50 testing images. The resolution of each sample is 64 \(\times \) 64. Compared to the extensive and challenging dataset ImageNet, Tiny-ImageNet serves as a smaller and more manageable version that is often utilized for benchmarking and evaluating new machine learning methods.

Architectures and settings Four architectures are adopted on above datasets: PreActResNet [40] (PreActResNet-18, PreActResNet-34, PreActResNet-50) and WideResNet [41] (WideResNet-16-8, WideResNet-28-10), DenseNet [42] (DenseNet-100-BC), MobileNet [43] (MobileNetV2). We follow the overall training protocol in [23]. Differently, this paper trains PreActResNet and MobileNet for 300 epochs, WideResNet and DenseNet for 200 epochs. On CIFAR-10/100, initial data augmentations involve random flipping and random cropping with 4-pixel padding for 32 \(\times \) 32 resolution. The training settings include SGD optimizer with a weight decay of 0.0001, momentum of 0.9, and batch size of 100. The initial learning rate is 0.2, decaying by a factor of 0.1 at epochs 100 and 200 for PreActResNet and MobileNet, 120 and 170 for WideResNet and DenseNet. On Tiny-ImageNet, basic augmentations encompass random flipping and random cropping for 64 \(\times \) 64 resolution, and we use the similar training ingredients as CIFAR.

Mix-related hyperparameters In PatchMix, patch size h and w are set to half of H and W. Empirically, the patch-mixed probability p is set to 0.5 to preserve the original information of patches. For the beta distribution, both \(\alpha \) and \(\beta \) are set to 0.2. In PatchMix-R, We primarily adopt the settings from PatchShuffle [29], where only 5% of the training data are randomly augmented, and the patch size is set to 4 \(\times \) 4 pixels. Similar to PatchMix, the patch-mixed probability p is set to 0.5 and the mixing values are randomly selected from the beta distribution with \(\alpha = 0.2\), \(\beta =\) 0.2.

4.2 Experimental results and analysis

4.2.1 Comparison with image-level methods

Random flipping (RF) and random cropping (RC). A comparison of our method with RF and RC is presented in Table 1. When applied individually, RC outperforms the other two methods with an error rate of 5.88%. Combining PatchMix with RF and RC reduces error rates by 2.72% and 1.43%, respectively. Therefore, PatchMix can serve as a supplement to existing regularization techniques. The ensemble of these three methods yields an error rate of 3.66%, 6.91% improvement over the baseline without any augmentation. In subsequent experiments, we adopt RF and RC as initial data transformations.

Table 1 Test errors (%) of PatchMix, random flipping and random cropping on CIFAR-10 with PreActResNet-18

Automatic augmentation techniques. TrivialAugment [12], an automatic augmentation technique, integrates multiple data augmentation methods such as rotation, scaling, color adjustment, and noise addition. Table 2 compares our method with TrivialAugment. When applied alone, PatchMix outperforms TrivialAugment, reducing the error by 0.31%. Evidently, PatchMix is more effective in enhancing accuracy compared to basic augmentation techniques and their ensembles in TrivialAugment. Combining PatchMix and TrivialAugment achieves a 3.35% error rate, 1.48% improvement over the baseline.

Table 2 Test errors (%) of PatchMix and TrivialAugment on CIFAR-10 with PreActResNet-18

4.2.2 Comparison with patch-level methods

Traditional patch-level methods [13,14,15] primarily focus on simple noise removal or addition to patches. In order to comprehensively compare the classification performance and robustness, this paper further introduce PatchShuffle and PatchMix-R. PatchMix+PatchMix-R refers to performing the PatchMix on the image first, followed by applying PatchMix-R. As detailed in Table 3, PatchMix+PatchMix-R demonstrates superior performance across all datasets and models, reducing the test error by 0.91–3.05% compared to the vanilla method. Additionally, the individual performance of PatchMix surpasses of other traditional patch-level methods.

Table 3 Test errors (%) of the patch-level methods on various models and datasets that composed of clean data.

4.2.3 Ensemble with mixup-based methods

Mixup-based methods combine or merge features from multiple images. This paper conducts experimental research on the ensemble of PatchMix and various multi-image methods. As shown in Table 4, even when applied to a single image, PatchMix (\(\alpha \) = 0.2, \(\beta \) = 0.2) surpasses many multi-image augmentation methods, such as Mixup, Manifold, and CutMix. Additionally, when applying PatchMix (\(\alpha \) = 1.0, \(\beta \) = 0.2) as the initial augmentation step before mixup-based methods, the classification performance of the trained models improved by 0.55–3.21%. Taking reference from Fig. 4, this paper infers that PatchMix, when used alone or combined with simple data augmentation techniques, benefits from the richness and diversity of patch mixing (refers to Fig. 4a). Differently, when ensembled with complex methods, aligning the patches closer to the original image and applying moderate perturbations (refers to Fig. 4c), enables the synergistic advantages of different augmentation methods to be fully realized. Therefore, PatchMix can ensemble with various data augmentation methods by simply controlling the beta distribution, thereby enhancing the performance of the model.

Table 4 Test errors (%) of the mixup-based methods combined with the PatchMix in Tiny-ImageNet classification with PreActResNet-18

4.2.4 Performance on large-scale images

This paper also evaluates the performance of our method on large-scale images. In this experiment, we primarily follow the training protocol outlined in [44] and select the VGG-19 [10] and WideResNet-101-2 models, pre-trained on the ImageNet dataset. Three distinct categories of large-scale datasets are chosen: Caltech-101 [45], which are relatively similar to the source dataset; Describable Texture [46], which differs significantly from the source dataset; and the commonly encountered fine-grained datasets, Stanford Cars [47] and Oxford 102 Flower [48]. Given that the images in the above datasets have large and varying sizes, this paper performs the necessary processing for image resizing. During the training, we also record the classification errors at different epochs. The experimental results are presented in Table 5. Across different types of classification datasets, PatchMix performs best on large-scale datasets and helps the pre-trained model quickly adapt to the new data domain.

Table 5 Test errors (%) of PatchMix and other augmentation methods on large-scale datasets after different epochs of fine-tuning

4.2.5 Performance on self-supervised learning and transfer learning scenarios

Self-supervised learning aims to learn useful representations from scalable unlabeled data without relying on human annotation. The Siamese network [28, 49,50,51] is one promising approach among many self-supervised learning approaches and outperforms supervised counterparts across numerous visual benchmarks. The Simsiam [28], as a typical Siamese network, aims to learn similar feature representations for different views of the same image, enabling effective transfer to various downstream tasks. Data augmentation techniques in Simsiam include random cropping, flipping, color jittering, etc., which are used as inputs to the encoder. In this paper, we introduce PatchMix into the data augmentation process of Simsiam, to further investigate its advantages in self-supervised learning and transfer scenarios.

Table 6 Test errors (%) of KNN classifier in the self-supervised stage and test errors (%) of the linear classifier in the downstream classification task when using the pre-trained model.

In the self-supervised learning phase, the Simsiam network is trained on the CIFAR dataset using PreActResNet-18. We use the KNN [52] (k=1) classifier as a monitor of the training progress. By comparing the predictions of the KNN classifier with ground-truth labels, the performance of the model can be evaluated. In the downstream classification task, this paper employs the pre-trained model with frozen weights to train a supervised linear classifier on the corresponding CIFAR dataset. The classification performance is quantified by the accuracy of the linear classifier. Importantly, it should be noted that data augmentation is exclusively employed during the self-supervised learning phase. The specific results are shown in Table 6. Evidently, PatchMix empowers the self-supervised model with better representation capabilities, resulting in superior performance when transferred to the downstream tasks.

4.3 Robustness against corruption

4.3.1 Performance on the CIFAR-C dataset

CIFAR-10-C and CIFAR-100-C [53] are two prevalent datasets for evaluating the robustness of computer vision models. Both datasets consist of the original CIFAR test images that poisoned by 15 distinct distortion types. Each distortion has five intensity levels when injected into images. Comparing with different patch-level methods on various datasets and models, this paper presents the experimental results in Table 7. Our proposed PatchMix-R outperforms the previous approaches in terms of robustness against unseen corruptions, reducing the test error by 8.79–13.92%. As for PatchMix, it exhibits better robustness than other methods. Notably, the robustness of PatchMix-R decreases slightly by 0.19–1.59% after incorporating PatchMix. One plausible interpretation is that the amalgamation of information from varying patches by PatchMix partially interferes with the intended perturbations of PatchMix-R. Therefore, how to effectively integrate these two facets will be an intriguing research topic in future work. Meanwhile, in order to demonstrate intuitively the advantages of our approach in adversarial perturbation, this paper also presents the errors of different corruptions and methods on CIFAR-10-C. In Table 8, our method exhibits the best or second-best performance in adversarial perturbation of various types. Specifically, PatchMix-R performs exceptionally best in noise, weather, and digital perturbations.

Table 7 Test errors (%) of patch-level methods on various models and datasets that composed of corrupted data.
Table 8 Clean error (%), mCE (%), and corruption error (%) of different corruptions and methods on CIFAR-10-C.
Fig. 6
figure 6

The CIFAR-ORS dataset consists of 3 types of algorithmically generated corruptions from occlusions, rotations and scale variations. Each type of corruption has five levels of severity, resulting in 15 distinct corruptions

4.3.2 Performance on proposed CIFAR-ORS dataset

Specifically for the common challenges such as occlusions, rotations, and scale variations, this paper proposes a new perturbation dataset called CIFAR-ORS. Following the design methodology in [53], we incorporate the above three types of perturbations into the test sets of CIFAR-10 and CIFAR-100 through an algorithmic approach. Each perturbation is also represented at five different severity levels, as shown in Fig. 6. After applying the 15 transformations to each image, this paper establishes the CIFAR-ORS dataset, including CIFAR-10-ORS and CIFAR-100-ORS.

The performance of different patch-level methods on CIFAR-ORS is compared, which are presented in Table 9. PatchMix achieves the optimal or near-optimal performance when dealing with patch-level perturbations, such as occlusion or scale. In addition, PatchMix-R excels in scenarios involving rotation. Based on previous experiments, we infer that PatchMix-R focuses more on enhancing the positional relationships between adjacent pixels which can effectively alleviate interference in terms of details. On the other hand, PatchMix enhances interactions between patches, making it notably efficacious for patch-level perturbations.

Table 9 Test errors (%) of patch-level methods on CIFAR-ORS dataset and various models.

4.4 Class activation map (CAM) analysis

Class activation map (CAM) [54] identifies the regions in an input image where the model concentrates its attention to recognize an object. In the experiments, this paper computes CAMs for a vanilla WideResNet-28-10 model equipped with various patch-level and mixup-based data augmentation methods on the small-scale dataset CIFAR-10. Figure 7 demonstrates that most existing state-of-the-art (SOTA) techniques, such as Cutout and PuzzleMix, often concentrate on specific representative parts of the content, such as the head of a bird, the wheels of a car, or the legs of a horse. The proposed PatchMix effectively directs the model attention toward the target object with higher precision compared to other methods, which indicates that PatchMix enable the network to learn comprehensive information of the classes, rather than merely memorizing key features.

Fig. 7
figure 7

CAM visualizations on images from CIFAR-10. The proposed data augmentation method guides the model to precisely focus on target object

Fig. 8
figure 8

CAM visualizations on large-scale images. The proposed data augmentation method has multiple advantages in assisting model fine-tuning. Pre-trained (the second row): CAMs from the model pre-trained on the ImageNet. Fine-tuned (the third and forth rows): CAMs from the pre-trained model after fine-tuning

Images from CIFAR-10 typically contain only one object. In order to explore the advantages of PatchMix on large-scale datasets that may contain multiple objects, this paper also visualizes the CAMs computed from the fine-tuned model from Table 5. We primarily utilize the Caltech-101 dataset and the pre-trained WideResNet-101-2 model. Figure 8 illustrates the similar effect when searching for a specific object in a scene with multiple objects. In Fig 8a, PatchMix captures comprehensive features of the target object, including the head of the anchor, the bodies of the deer and the grand piano. In Fig 8b, PatchMix effectively recognizes multiple target categories, e.g., scattered ibises and ants, overlapping elephants. In Fig 8c, PatchMix enables the model to accurately identify the target chain and pole. Also, the most representative feature regions of target are focused on, e.g., the significant areas in recognizing the dolphin.

4.5 Ablation study

When implementing PatchMix on CNN training, the evaluation of hyperparameters becomes crucial. To demonstrate the impact of these hyperparameters on the model performance, experiments are performed on the CIFAR-10/100 and Tiny-ImageNet datasets using the PreActResNet-18 network under varying hyperparameter settings.

4.5.1 The effect of patch size

In this section, we verify the effect of patch size that determines the range of mixing. Referred to [29, 55] for image chunking, non-overlapping square-shaped patches are adopted, as opposed to irregular shapes or overlapping sampling. This approach is considered the simplest yet most effective way to validate the feasibility of PatchMix. As illustrated in Table 10, PatchMix exhibits optimal performance when the patch height (h) and width (w) are configured to half of the image height (H) and width (W). However, when h and w are reduced to 1/4 and 1/8 of H and W, the test error gradually increases and the classification performance of PatchMix weakens. The ablation results underscore the importance of patch size.

Table 10 Test errors (%) with different patch size on CIFAR-10/100 and Tiny-ImageNet
Table 11 Test errors (%) with different patch-mixed probability on CIFAR-10/100 and Tiny-ImageNet
Table 12 Test errors (%) with four typical types of beta distributions controlled by different \(\alpha \) and \(\beta \) on CIFAR-10/100 and Tiny-ImageNet.

Analogous to the jigsaw puzzle, if the image is subdivided into a greater number of smaller fragments, it will become increasingly challenging to establish correspondences, which leads to the loss of vital information. Consequently, PatchMix adopts a straightforward strategy of setting the patch size to half of the image size.

4.5.2 The effect of patch-mixed probability

The patch-mixed probability p determines the mixing ratio of internal information within an image. The higher of p, the richer and more complex the information represented in the augmented image. In this section, we further investigate the impact of p on the model’s performance, as presented in Table 11. It can be observed that the performance improves in the presence of patch mixing and the performance is generally optimal within the range of 0.4–0.7 for p. When the value of p is too low, it is posited that the advantages of PatchMix may not be fully realized. Conversely, if all patches undergo mixing operations, it may lead to excessively high complexity in the combinations. Therefore, in our experiments, p is chosen to be 0.5.

Table 13 Test errors (%) with different \(\alpha \) and \(\beta \) of beta distribution on CIFAR-10/100 and Tiny-ImageNet.
Table 14 Epoch duration (s) with the incorporation of PatchMix for different methods in CIFAR10 with PreActResNet18

4.5.3 The effect of beta distribution

In Mixup [19], both \(\alpha \) and \(\beta \) in beta distribution are set to 0.2, yielding optimal experimental outcomes. However, the specific reasons behind this choice remain scantily explained. Hence, this section delves into this aspect in the context of PatchMix.

This study initially compares four typical beta distributions with distinct shapes shown in Fig. 4. In each distribution, three pairs of values are taken to represent different levels of intensity. From Table 12, the best outcome is achieved when setting \(\alpha \) and \(\beta \) to the same value within the range of 0–1, which is considered to be the optimal mixing range for the image classification. Simply put, the beta distribution in Fig. 4c tends to retain the features of original patches, providing stability, while the beta distribution in Fig. 4d increases regional diversity. The aim of PatchMix is to balance these tendencies; hence, the parameters range as demonstrated in Fig. 4a.

To further probe this particular distribution, different \(\alpha \) and \(\beta \) within the range of 0–1 are investigated. From Table 13, it can be concluded that the classification performance does not differ significantly under this distribution and the optimal results are obtained with both \(\alpha \) and \(\beta \) set to around 0.2. Therefore, the parameters of the beta distribution in our experiment are uniformly chosen as 0.2, excluding other complex values.

4.6 Discussion

Extensions and variations PatchMix offers several avenues for further exploration and research. It is potential to investigate the impact of using patches with different or irregular sizes, as it may lead to more comprehensive representations. Similar to the concept of Manifold Mixup [20], exploring the integration of PatchMix into intermediate layers of the model would be valuable. Furthermore, previous studies have demonstrated the effectiveness of replacing patches from pairs of images in improving network performance in visual tasks [22, 56]. Building upon this foundation, the idea of mixing patches could potentially be introduced to further enhance the feature extraction capabilities.

In weakly supervised object detection and segmentation tasks, many CAM-based pseudo-label generation methods often suffer from the problem of focusing only on partial discriminative foreground regions [57, 58]. PatchMix provides a promising solution by accurately attending to the holistic characteristics of the classes and holds great potential for application. In addition to CNNs, current research has also begun exploring image augmentation techniques suitable for the vision transformer (ViT) architecture [59]. The investigation of PatchMix, which based on rich relative positional relationships, is worth considering in terms of its potential impact on encoding positional information.

Computational overhead The implementation of PatchMix itself is not inherently complex and is not constrained by the dataset or network architecture. This section focuses on analyzing the computational overhead of PatchMix to evaluate its scalability. The experiments are conducted on a server equipped with an Intel Xeon Silver 4216 CPU running at 2.10 GHz. Each set of control experiments is performed exclusively on a same NVIDIA Tesla T4 graphics card. We statistically measure and compare the average epoch durations of various methods during training, and the findings are showcased in Table 14. PatchMix exhibits remarkable efficiency compared to other data augmentation methods, with a mere 0.54-s computational overhead per epoch. Additionally, the integration of PatchMix with other methods imposes minimal overhead, requiring a mere 2% additional time investment. Consequently, within the realm of expansive datasets and complex networks, PatchMix emerges as a versatile plug-and-play data augmentation solution, offering the advantage of controllable computational overhead.

5 Conclusion

In this paper, we propose two data augmentation techniques, PatchMix and PatchMix-R, with the goal of improving the generalization and robustness of classification models. PatchMix introduces mixup into traditional patch-level augmentation, generating a large amount of new training data through a random but controllable processing approach. This method can seamlessly integrate into other data augmentation methods of varying complexities under different beta distribution. CAMs visualize the advantages of PatchMix when applied to both small-scale and large-scale images. PatchMix-R is an extension of PatchMix that mixes pixels within each patch instead of across different patches. This modification significantly enhances the ability of neural networks to resist adversarial perturbations. Extensive experiments on the widely used classification datasets and networks are conducted to verify the feasibility and effectiveness of our methods.