Keywords

1 Introduction

Knowledge Graphs are a significant component of enterprise data infrastructure and a core element of upper-layer applications [1]. In January 2022, Alibaba released AliOpenKG, the first Open Knowledge Graph for digital commerce. The process of creating e-commerce product relationships is a crucial step in the creation of the Knowledge Graph for digital commerce. However, the personalization of merchants’ published product information has caused inadequate standardization and structuring of product, and different categories of product have various unique and important attributes, making it challenging to align fine-grained similar product.

Existing techniques [2] for fine-grained product alignment are mostly based on representation learning due to the large number of product on e-commerce platforms. Specifically, the item representation is obtained by characterizing the item with unstructured and structured information of the item. The same item is then obtained via vector retrieval. However, this identical product mining task views product alignment as a binary classification task based on product pairs, and employing vector-based retrieval is too complicated. For identical product mining, we employ sample pair-based contrastive learning. We first create separate textual and visual representations of the product using representation learning, then concatenate the two to create the final product representation. Finally, we utilize CoSENTFootnote 1 to gradually refine the product representation. In representation learning of text, since the traditional text representation method cannot highlight the local continuous token information, we propose the K-Gram exponential decay scheme, inspired by N-gram, for capturing and aggregating the surrounding continuous token information, which in turn refines the text representation. Additionally, inspired by Circle Loss [3] and Curricular Loss [4], we improved CoSENT further to create Circle-CoSENT and Curricular-CoSENT to promote contrastive learning between sample pairs.

In summary, this paper makes the following contributions:

  • We propose the K-Gram Exponential Decay scheme refine text representation.

  • We apply CoSENT for contrastive learning of sample pairs and further improve it to create Circle-CoSENT and Curricular-CoSENT.

  • We adopt model ensemble for multimodal representation learning. It contains two sub-models for image representation and two for text representation.

2 Related Works

2.1 Product Matching

Product matching is generally based on representation learning. Tracz et al. [5] proposed category hard batch construction strategy and applied Triple Loss for product matching. Li et al. [6] utilized product titles and attributes to match product across platforms. Li et al. [7] proposed the Path-based Deep Network, which combines diversity and personalization to enhance matching performance. Peeters et al. [8] proposed the application of supervised contrastive learning for product matching.

2.2 Multimodal Representation Learning

Existing methods for multimodal information fusion generally use simple operations (e.g., concatenation, weighted summation) or attention-based methods. We utilize concatenation for fusion. Bi et al. [9] proposed characterizing three different types of news textual information (e.g., title, topic category, and entities) separately and obtaining news embedding by attention mechanism. Yu et al. [10] applied Cross-Modal Attention Mechanism to obtain textual representation of fused images and image representation of fused text and connect them for multimodal interaction.

3 Methodology

3.1 Text Representation Module

In our text module, we choose RoBERTa [11] as encoder and feed the embedding of final layer into the following two sub-modules.

Conventional Method. First, we obtain the last layer of hidden layer features from the output of RoBERTa and perform the average pooling operation. Then, we use the dropout strategy to enhance the robustness. Finally, the final text representation is obtained by a layer of MLP.

Fig. 1.
figure 1

Text representation module with K-Gram Exponential Decay and CNN

K-Gram Exponential Decay. We consider that text representation can be categorized into token representation, word representation, phrase representation, etc. The conventional text representation method can only highlight the global token information gained by the attention mechanism, not the local continuous token information that is crucial for forming word representation and phrase representation. Considering that the associated words generally occur consecutively, we are inspired by N-Gram and propose K-Gram Exponential Decay, a sliding window-like mechanism to capture phrase expressions of K consecutive tokens. We employ the K-Gram Exponential Decay scheme to process the token embedding output by RoBERTa to obtain multi-channel embedding, which is then enriched with the hidden information extracted by CNN [12]. Finally, the obtained embedding are concatenated with the embedding from the average pooling module and passed through an MLP layer to obtain the final text representation of the product. The general framework of the module is shown in Fig. 1. The K-Gram Exponential Decay scheme is described in the following.

Fig. 2.
figure 2

K-Gram Exponential Decay

The K-Gram exponential decay scheme refines word embedding by aggregating information from surrounding token embedding. To reduce the computational cost, we parallelize the computation using a circular shift operation to improve computational efficiency and reduce running time. In addition, inspired by the decay factor of MDP in reinforcement learning, we exponentially decay the weights of the token embedding within the window to highlight the effect of relative position on the token embedding. The exponential decay weight \(\alpha \) is a hyper-parameter, and we fix it to 0.8 in experiments. Given that token embedding fusion involves directionality, we consider forward K-Gram, backward K-Gram, and their combined form as choices for our downstream processing. The specific forms of the three K-Grams are shown in Fig. 2, and the formulas are as follows:

Forward K-Gram:

$$\begin{aligned} w^k_i = \sum _{j=0}^{k} e_{i-j} \times \alpha ^j \end{aligned}$$
(1)

Backward K-Gram:

$$\begin{aligned} w^k_i = \sum _{j=0}^{k} e_{i+j} \times \alpha ^j \end{aligned}$$
(2)

Bidirectional K-Gram:

$$\begin{aligned} w^k_i = \sum _{j=0}^{k} e_{i-j} \times \alpha ^j + \sum _{j=1}^{k} e_{i+j} \times \alpha ^j \end{aligned}$$
(3)

where \(w^k_i\) denotes the word embedding obtained after K-Gram Exponential Decay, and e denotes the token embedding output by text encoder. j stands for the relative distance, and the greater the relative distance, the lower its weight.

3.2 Image Representation Module

We employ Swin-Transformer [13] as the image encoder in the era when Transformer architectures were widely used in computer vision. An increasing body of research contends that the Swin-Transformer, which inherits the notion of CNN hierarchical receptive fields, may be the ideal replacement for CNN. Specifically, Swin is separated into four stages, each of which results in a smaller input feature map and a larger receptive field. Each stage consists of a Patch Merging module and a Swin-Transformer Block. The role of Patch Merging is to downsample the image, similar to the pooling layer in CNN. The Swin-Transformer Block consists of Window Multi-Head Self-Attention,Shifted-Window Multi-Head Self-Attention, Layer Norm, MLP and Residual Connection.

Swin-Transformer and MLP are used to transform the image in order to obtain the final image embedding, then image embedding is utilized to calculate product similarity.

3.3 Contrastive Learning Objective

Since identical or different product always occur in pairs in this product mining task, we apply Cosine Sentence (CoSENT) to explicitly distinguish the difference among items. Inspired by Circle Loss, we add weight and margin to CoSENT to increase the weight of difficult pairs and separate them from each other. Inspired by Curricular Loss, we gradually increase the weight of the difficult sample pairs during the training process, so that the model gradually focuses on the difficult sample pairs. The following is an introduction to CoSENT, Circle-CoSENT and Curricular-CoSENT respectively.

CoSENT. The essence of CoSENT is comparative learning based on sample pairs, loss function is as follows:

$$\begin{aligned} \mathcal L = \log \left( 1+\sum _{(i,j)\in \varOmega _{pos}, (u,v)\in \varOmega _{neg} } e^{\lambda \bigl (cos(e_u, e_v)-cos(e_i, e_j) \bigl )}\right) \end{aligned}$$
(4)

where \(\varOmega _{pos}, \varOmega _{neg}\) are positive sample pairs set and negative sample pairs set, respectively. \(e_u\) represents representation of product u. \(\lambda \) is a hyper-parameter set to 20 in our experiment.

The optimization goal of CoSENT is to increase the cosine similarity of positive sample pairs while decreasing the cosine similarity of negative sample pairs. By subtracting the cosine similarity of positive sample pairs from the cosine similarity of negative sample pairs, it increases the distance between positive and negative sample pairs. The benefit of CoSENT is that the threshold for identifying whether a sample pair is a positive or negative pair does not need to be predetermined.

Circle-CoSENT. We add weight and margin to CoSENT, the loss function of Circle-CoSENT is as follows:

$$\begin{aligned} \mathcal L = \log \left( 1+\sum _{(i,j)\in \varOmega _{pos}, (u,v)\in \varOmega _{neg} } e^{\lambda \Bigl (\omega _{neg} \bigl ( cos(e_u, e_v) + m_{neg} \bigl ) - \omega _{pos} \bigl (cos(e_i, e_j) - m_{pos} \bigl ) \Bigl )}\right) \end{aligned}$$
(5)
$$\begin{aligned} \omega _{neg} = \frac{cos(e_u, e_v) + 1}{2}, \omega _{pos} = 1 - \frac{cos(e_i, e_j) + 1}{2} \end{aligned}$$
(6)

where \(\omega _{pos}, \omega _{neg}, m_{pos}, m_{neg}\) are positive sample pairs weight, negative sample pairs weight, positive sample pairs margin, negative sample pairs margin, respectively. \(\omega _{neg}, \omega _{pos}\) imply respectively that negative sample pairs are more difficult the closer they are to 1 and positive sample pairs are more difficult the closer they are to 0. Furthermore, we hope that the positive sample pair will be accurately predicted even if \(m_{pos}\) is subtracted and the negative sample pair will be correctly predicted even if \(m_{neg}\) is added, further separating the positive and negative sample pairs.

Curricular-CoSENT. We compel CoSENT to master the straightforward sample pairs before moving on to the challenging ones, the loss function of Curricular-CoSENT is as follows:

$$\begin{aligned} \mathcal L = \log \left( 1+\sum _{(i,j)\in \varOmega _{pos}, (u,v)\in \varOmega _{neg} } e^{\lambda \Bigl (f \bigl (cos(e_u, e_v) \bigl ) - f \bigl (cos(e_i, e_j) \bigl ) \Bigl )}\right) \end{aligned}$$
(7)
$$\begin{aligned} f \bigl (cos(\cdot ,\cdot ) \bigl ) = {\left\{ \begin{array}{ll} cos(\cdot ,\cdot ), &{} \text {if } (\cdot ,\cdot ) \text { is easy sample pair} \\ cos(\cdot ,\cdot ) \bigl (t+cos(\cdot ,\cdot ) \bigl ), &{} \text {if } (\cdot ,\cdot )\text { is hard sample pair} \end{array}\right. } \end{aligned}$$
(8)

where t grows gradually from 0 to 1. Negative sample pairs greater than a particular threshold and positive sample pairs under a particular threshold are challenging samples.

Fig. 3.
figure 3

Model ensemble

3.4 Model Ensemble

We use four models for integration, RoBERTa-Base for text representation \(\mathbb R^{128}\), RoBERTa-Large for text representation \(\mathbb R^{128}\), Swin-Transformer for image representation \(\mathbb R^{256}\), and Swin-Transformer for image representation \(\mathbb R^{512}\). Separate concatenations of the two text representations and the two image representations are performed, and then the text representation is concatenated with the image representation. Note that normalization is required before each concatenation. The model ensemble is shown in Fig. 3.

4 Experiments

4.1 Dataset

We conduct experiments on CCKS2022 identical product mining competition dataset. The training set contains 71,452 product information and 57,741 pairs of labeled product pairs data, and the validation set contains 16,876 product information and 20,707 pairs of unlabeled product pairs data, and the test set contains 17,132 product information and 15,909 pairs of unlabeled product pairs data. The product information data contains ten features such as id, industry_name, cate_name, cate_id, cate_name_path, cate_id_path, image_name, title, item_pvs, and sku_pvs.

Furthermore, we divide the local-train set, local-valid set, and local-test set on the basis of the training set in the ratio of 8:1:1.

4.2 Data Pre-processing

We primarily preprocess the product information data’s item_pvs feature. First, we remove the redundant values and overlength values from them. Then, as new features, we copy the values of brand, item number, and model number from item_pvs. Finally, we remove some symbols to shorten the text.

4.3 Experimental Setup

We choose RoBERTa-Base, RoBERTa-Large, and Swin-Transformer-Large as our pre-trained models. For the text module, we set the learning rate to 2e–5, the batch size to 128, the epoch to 30, the maximum sequence length to 256, and the threshold is set to 0.8. And for the image module, the learning rate is 1e–5, the batch size is 32, the epoch is 50 and the threshold is set to 0.76.

In addition, while training the model, we only unfreeze the last three layers of RoBERTa and the last two layers of Swin-Transformer.

4.4 Post-processing

We add some rules to do further threshold processing when concatenating the final text representation and image representation. For example, if the brand and model number of two products differ but the image similarity is low, we may choose to increase their threshold. In addition, if two products have a high image or text similarity and the same brand, we choose to lower the threshold.

4.5 Experimental Results

The main results on local-valid data are shown in Table 1. As can be seen from the table, the Circle-CoSENT, Curricular-CoSENT and K-Gram Exponential Decay (KGED) we adopted have some improvement on the local-valid data. The image module performs best on local-valid data, with an F1 score of 90.21%.

Table 1. The main results on local-valid data.

The experimental results of our methodology using validation data are shown in Table 2. We can find that Circle-CoSENT has some degradation on validation data, and we guess that this may be due to the difference between the two datasets. Therefore, we only utilize the ordinary CoSENT, subsequently. Text model ensemble and image model ensemble represent the ensemble of RoBERTa-Base and RoBERTa-Large, and the ensemble of Swin-Transformer-Large and Swin-Transformer-Large, respectively. The results of Text model ensemble (KGED) are better than those of Text model ensemble, which proves the effectiveness and robustness of KGED. Additionally, the model ensemble effect has improved significantly, indicating a higher level of complementarity between the input from various modalities. This further demonstrates the effectiveness of our concatenate-based multimodal information fusion method in this task. Given the decreasing effectiveness of Model Ensemble (KGED) with post-processing, we infer that the current image model ensemble is implemented more effectively with text model ensemble than it is with text model ensemble (KGED). The model ensemble and post-processing combination, with an F1 score of 90.57%, makes the best results overall.

Table 2. The main results on valid data.

5 Conclusion

In this paper, we apply multimodal representation learning to product matching. For the text representation module, we propose K-Gram Exponential Decay scheme with CNN to refine text representation. In order to enhance the distance between matched and unmatched product pairs, we also utilize sample pair-based contrastive learning. Last but not least, we combine the two text representation modules with the two image representation modules to lower the variance of the separate models and enhance the ability of the product to be represented. The experimental results demonstrated that our model performs significantly with an F1 score of 90.57% on the validation set and 90.77% on the test set, and is ranked first in the identical product mining competition for CCKS2022. Our long-term research goal is to improve the K-Gram Exponential Decay in the text representation module such that, when combined with the image representation module, it can better represent product.