Keywords

1 Introduction

Image matting is a challenging tasks in computer vision that aims to separate the foreground from a natural image by predicting the transparency of each pixel. It has been applied in the field of biometric recognition, such as finger-vein [1], gait recognition [2, 3], and face verification [4], as it can finely delineate the target contours, thus facilitating biometric recognition tasks.

The image \(\textbf{I}\) can be represented as a convex combination of the foreground \(\textbf{F}\) and the background \(\textbf{B}\).

$$\begin{aligned} \textbf{I}_i=\alpha _i\textbf{F}_i+(1-\alpha _i)\textbf{B}_i \qquad \alpha _i \in \left[ 0,1 \right] \end{aligned}$$
(1)

where \(\alpha _i\), \(\textbf{F}_i\), and \(\textbf{B}_i\) respectively represent the transparency, foreground color, and background color at position i in the image. This problem is a highly underdetermined mathematical problem. There are three unknowns and only one known in the equation. The trimap is introduced to provide additional constrains. It consists of three parts: the known foreground region where the alpha value is known to be 1, the known background region where the alpha value is 0, and an unknown region where the alpha value needs to be determined. Existing deep learning-based matting methods have greatly surpassed traditional methods in terms of the quality of alpha mattes, attracting a rapid increase in attention to deep learning-based matting methods.

The loss function is a fundamental component of deep learning, as it measures the difference between the predicted output of a model and the true labels. It provides guidance for model training and optimization objectives, allowing the model to gradually improve its prediction accuracy. The alpha prediction loss is computed as the average absolute difference between the predicted alpha matte and the ground-truth alpha matte. The composition loss, introduced by [5], utilizes the ground-truth foreground and background colors to supervise the network at the pixel level. Gradient loss [6] has been proposed to improve the sharpness of the predicted alpha matte and reduce excessive smoothness. The Laplacian Pyramid loss [7], a multi-scale technique, is employed to measure the disparities between the predicted alpha matte and the ground-truth alpha matte in local and global regions. Indeed, the loss functions used for image matting encompass supervision at the pixel level as well as supervision of the gradient and detail changes in the alpha channel, which improves the accuracy and quality of the matting results. But these loss functions only focus on the differences between the alpha matte predicted by the network and the ground-truth alpha matte. Consequently, the network may not effectively learn valuable information inherent in the ground-truth across different feature layers. In general, increasing the depth of a neural network can improve its representation ability to some extent. To better train the network, it is common to add auxiliary supervision to certain layers of the neural network. Some methods [8, 9] supervise the multi-scale features obtained by the decoder at different scales. However, directly supervising neural networks with ground-truth alpha mattes causes the decoder at a small scale to strictly approximate the ground-truth alpha mattes, which may result in overfitting. Figure 1 provides an example. When the image matting method is applied to scenarios different from the training images, the prediction of the decoder at a small scale may not be accurate. Any prediction error of the decoder would degrade the quality of alpha mattes.

Fig. 1.
figure 1

From left to right, the images are the input, trimap, ground-truth, the predicted results by the MatteFormer and ours. We can see that there are serious errors in the prediction of the intermediate details of alpha. These errors are the result of inaccurate alpha prediction caused by low-resolution feature estimation.

We introduce a loss function called Alpha Local Difference Loss(ALDL), which leverages the local differences within the ground-truth to supervise features at various resolution scales. Unlike gradient loss, ALDL captures the differences between the pixel and its surrounding pixels in the ground-truth, and utilizes these differences as constraints to supervise the features of the image. Gradient loss only describes the gradient of the central pixel in the x and y directions, without explicitly capturing the specific variations between the central pixel and local surrounding pixels. Furthermore, instead of applying strict supervision on early decoders [8, 9], ALDL is a loose supervision that leads the matting network to learn the relationships between features, rather than strictly adhering to specific numerical values.

This work’s main contributions can be summarized as follows:

  1. 1.

    We propose a loss function called Alpha Local Difference Loss specifically designed for matting networks, which utilizes the supervision of local feature relationships. This loss function can be easily integrated into existing networks with hardly any need to add extra parameters.

  2. 2.

    Through experiments conducted on multiple networks and datasets, our Alpha Local Difference Loss demonstrates the ability to improve the generalization capability of matting networks, resulting in enhanced object details in the matting process.

2 Methodology

In this section, we illustrate how to define the difference between each point and its local neighboring points based on the local information of the ground-truth alpha. The local difference is embedded into the image features, and the Alpha Local Difference Loss is proposed to constrain the network in learning this difference. Furthermore, an analysis is conducted to determine which features in the neural network should be supervised.

Fig. 2.
figure 2

The process of calculating ALDL

2.1 Local Similarity of Alpha Labels and Features

Consistent with the assumption of closed-form matting [10], we assume that pixels within a local region have the same foreground color \(\textit{F}\) and background color \(\textit{B}\). According to Eq. (1), we can obtain the pixel value difference \(\varDelta I\) between two points \(\textbf{x}\) and \(\textbf{y}\) within a local region. Similarly, by using the ground-truth alpha, we can also obtain the alpha value difference \(\varDelta \alpha \) between point \(\textbf{x}\) and \(\textbf{y}\).

$$\begin{aligned} I_x-I_y=\alpha _x F+(1-\alpha _x)-\alpha _yF-(1-\alpha _y)B=(\alpha _x-\alpha _y)(F-B) \end{aligned}$$
(2)
$$\begin{aligned} \varDelta I= \varDelta \alpha (F-B) \end{aligned}$$
(3)
$$\begin{aligned} \varDelta f= \varDelta \alpha (f_F-f_B) \end{aligned}$$
(4)

It can be observed that there is a linear relationship between the color value difference \(\varDelta I\) and the \(\textit{F}-\textit{B}\) within a local region on the image. Because \(\textit{F}\) and \(\textit{B}\) are invariant within the local region, \(\textit{F}-\textit{B}\) is a fixed vector. By analogy, we can consider feature difference \(\varDelta f\) as a linear combination of features \(f_F\) and \(f_B\). In spatial terms, for two features \(f_F\) and \(f_B\) within a local region, Eq. (4) is obtained. The features should also be constrained to satisfy this relationship as much as possible. This relationship embodies the intrinsic meaning of matting, and it is believed that it will help the network learn to synthesize Eq. (1).

2.2 The Design of Loss Function

For a position i, let \(\partial \left\{ i \right\} \) denote the set of points within the M1 \(\times \) M2 region R, where M1 and M2 respectively denote the height and width of the R, and pixel i is located at the center position of the R. The set of values for the ground-truth alpha at position i is: \(\partial \left\{ \alpha _i \right\} =\left\{ \alpha _{i1}, \alpha _{i2}, \alpha _{i3},...,\alpha _{iM1\times M2} \right\} \). It is worth noting that \(\alpha _{ij}\) is a scalar. We can compute the differences between \(\alpha _{i}\) and each element in its set \(\partial \left\{ \alpha _i \right\} \).

$$\begin{aligned} dif\left( \alpha _i,\alpha _{ij} \right) =\alpha _i-\alpha _{ij} \end{aligned}$$
(5)
$$\begin{aligned} sim_\alpha \left( \alpha _i,\alpha _{ij} \right) =1-\left| dif\left( \alpha _i,\alpha _{ij} \right) \right| \end{aligned}$$
(6)
$$\begin{aligned} sim_f\left( f_i,f_{ij} \right) =\varphi (\cos (norm(f_i),norm(f_{ij}))) \end{aligned}$$
(7)
$$\begin{aligned} loss=\sum _{i}^{} \sum _{j}^{} sim_\alpha \left( \alpha _i,\alpha _{ij} \right) - sim_f \left( f_i,f_{ij} \right) \end{aligned}$$
(8)

\(dif\left( \alpha _i,\alpha _{ij} \right) \) represents the difference between the alpha value of the central pixel i and the alpha values of other positions within the region R. To facilitate computation, we normalize the values between 0 and 1 using the \(sim_\alpha \) function. The smaller the difference between \(\alpha _i\) and \(\alpha _{ij}\), the closer the value of \(sim_\alpha \) tends to approach 1. Given the feature \(X\in R^{H/r\times W/r \times C} \), for any point at the location i in X, \(\partial \left\{ f_i \right\} =\left\{ f_{i1}, f_{i2}, f_{i3},...,f_{iM1\times M2} \right\} \), where \(f_{iM1\times M2}\in R^{1\times 1 \times C} \), r is the downsampling factor. In order to align the resolution of alpha with the feature, the ground-truth alpha is downsampled to obtain \(\partial \left\{ \alpha _i^r\right\} \). Each element in the set \(\partial \left\{ \alpha _i^r\right\} \) and \(\partial \left\{ f_i\right\} \) corresponds to each other based on their spatial positions. It is worth noting that our goal is to correspond the vector \(\varDelta f\) with the scalar \(\varDelta \alpha \), so the similarity between the two features is calculated to convert the vector into a scalar. The definitions of distance between features is (7), \(norm(f_i)\) denotes the calculation of the norm of vector \(f_i\), \(\varphi \) represents a mapping function, \(\cos \) refers to the calculation of the cosine similarity. The aim is to maintain consistency in terms of both the differneces of alpha values and the differneces of features between each point and its neighboring adjacent points. Hence, the definition of Alpha Local Difference Loss is Eq. (8).

2.3 The Supervisory Position of ALDL

[11] indicates that different layers in a convolutional neural network tend to learn features at different levels. Shallow layers learn low-level features such as color and edges and the last few layers learn task-relevant semantic features. If the features at shallow layers are supervised to capture task-related knowledge, the original feature extraction process in the neural network would be overlooked. Therefore, we only supervise the features outputted by the decoder. Additionally, our supervision relationship is derived from the ground-truth alpha in local regions, which can be considered as extracting features at a lower-level semantic level. Alpha Local Difference Loss should not be used for supervising features representing higher-level semantic features with very low resolution. As shown in the Fig. 2, taking MatteFormer [9] as an example, its decoder outputs features with resolutions of 1/32, 1/16, 1/8, 1/4, and 1/2. Supervision is only applied to the features with resolutions of 1/8, 1/4, and 1/2 in the decoder, while the feature with a resolution of 1 is not supervised in order to reduce computational cost.

Fig. 3.
figure 3

Y-axis: the SAD error on AIM-500. X-axis: the correlation coefficient between the difference of alpha and the difference of feature.

Table 1. The effectiveness of implementing ALDL

3 Experiments

To validate the effectiveness of the suggested Alpha Local Difference Loss function, we extensively perform experiments on various matting baselines using multiple benchmark datasets. The performance is assessed in real-world scenarios to verify its generalization capability.

3.1 Datasets and Implementation Details

We train models on the Adobe Image Matting [5] dataset and report performance on the real-world AIM-500 [8], AM-2K [12], P3M [13]. AIM-500 contains 100 portrait images, 200 animal images, 34 images with transparent objects, 75 plant images, 45 furniture images, 36 toy images, and 10 fruit images. The AM-2k test set comprises 200 images of animals, classified into 20 distinct categories. P3M-500-NP contains 500 diverse portrait images that showcase diversity in foreground, hair, body contour, posture, and other aspects. These datasets comprise a plethora of human portrait outlines and exhibit numerous similarities to datasets employed for tasks like gait recognition and other biometric recognition tasks. Our implementation is based on PyTorch. No architectural changes are required. We only modify the loss function. The height M1 and width M2 of the local region R are both set to 3, and the center position of R belongs to an unknown region in the trimap. Various matting models utilize distinct loss functions. In order to effectively illustrate the efficacy of ALDL, we directly incorporate ALDL into the existing loss function. In line with the approach outlined in [14], four widely adopted metrics are employed to assess the quality of the predicted alpha matte. These metrics include the sum of absolute differences (SAD), mean squared errors (MSE), gradient errors (Grad), and connectivity errors (Conn). Four matting baselines, namely: GCA Matting [15], MatteFormer [9], VitMatte [16], AEMatter [17] are evaluated. GCA implements a guided contextual attention module to propagate opacity information based on low-level features. MatteFormer introduces prior-token for the propagation of global information. VitMatte proposes a robust matting method based on Vit [18].

3.2 Proof of the Local Similarity Hypothesis Between Alpha and Feature

In order to validate the effectiveness of the local similarity hypothesis in improving image matting, during the inference stage, we extracted the feature outputs from the intermediate layer. Based on (6) and (7), the correlation coefficient of \(sim_\alpha \) and \(sim_f\) for each point in the unknown region of the trimap have been calculated. It can be observed that the higher the correlation coefficient, the better the matting performance of the method from Fig. 3. This indicates that if the features satisfy the local differences defined by ground-truth alpha, it can improve the quality of the matting.

3.3 Generalization

ALDL was applied to three different baselines and compared with their counterparts without ALDL, as shown in the Table 1. It can be observed that for MatteFormer and VitMatte, ALDL improves their generalization ability on three datasets. This suggests that constraining the relationships between local features can help the network better understand the matting task. The combination of GCA with ALDL demonstrates its generalization ability, particularly on the P3M dataset. GCA incorporates a shallow guidance module to learn feature relationships, but evaluating the quality of these relationships poses a challenge. In contrast, ALDL explicitly constrains local feature relationships using ground-truth alpha, aligning with the objective of GCA’s shallow guidance module. Consequently, the addition of ALDL to GCA results in moderate performance improvements on the AIM-500 and AM-2K datasets. GCA consistently performs well according to the Grad metric, indicating that ALDL excels at capturing intricate details, accurately defining contours, and proves advantageous for downstream tasks involving matting.

Table 2. Ablation experiment of ALDL

3.4 Ablation Study of Deep Supervision

An ablation experiment was conducted using MatteFormer, as its decoder’s output features are supervised with ground-truth. The difference is that ALDL supervises the local differential relationships between features, while MatteFormer directly supervises the alpha values at the feature level. As shown in the Table 2, MatteFormer marked with R1 or R2 denotes removing the structure that originally outputs alpha values from the decoder and instead directly supervising the feature level with ALDL. GCA marked with R1 or R2 represents the application of the ALDL to the intermediate layer features of the decoder. Experimental results demonstrate that applying ALDL to features, which is a relatively weak constraint, yields better performance than directly supervising with alpha values. Additionally, since ALDL explores local information from ground-truth, which essentially belongs to low-level features, it is more suitable for shallow features rather than deep features.

4 Conclusion

This study focuses on the loss function of deep image matting methods. We analyzed the shortcomings in the loss functions of existing matting models, and proposed the alpha local difference loss function, which takes the ground-truth alpha matte and the composition formula of image matting as the starting point, to supervise the image features. Extensive experiments are performed on several test datasets using state-of-the-art deep image matting methods. Experimental results verify the effectiveness of the proposed ALDL and demonstrate that ALDL can improve the generalization ability of deep image matting methods.