Keywords

1 Introduction

Learning universal sentence presentation is a fundamental problem in natural language processing and has been extensively studied in [12, 16]. Based on Contrastive Learning [1, 20], SimCSE [5] provides two simple but strong sentence semantic presentation models: (1) Unsupervised SimCSE, which extracts multi-view features [17] through dropout [15]. A sentence with itself is created as an anchor-positive pair, while an anchor-negative pair is formed with other sentences in the batch. InfoNCE [1] loss is used to shorten the positive value and push the negative value away, and model parameters are optimized. Based on the multi-view learning of dropout, unsupervised SimCSE outperforms other unsupervised/supervised models. (2) Supervised SimCSE further improves performance by using NLI data labels as data augmentation. The entailment sentence pair is pictured as the anchor-positive pair, and the contradiction sentence pair as the hard-negative pair. Results show that unsupervised SimCSE exceeded the previous SOTA model IS-BERT [20] by 7.9%, while supervised SimCSE exceeded unsupervised SimCSE by 6.9%. Supervised SimCSE is also 4.6% higher than the previous supervised SOTA models SBERT [14] and BERT-whitening [16].

Fig. 1.
figure 1

Sentence-level representation and visual analysis of anisotropy in multiple domains under SBERT, unsupervised SimCSE and CCDC models

Extending supervised SimCSE to multi-domain sentence representation scenarios [24] requires solving two problems. One is hard-negative mining for out-domain but semantic-alike samples. Another is how to generate pseudo-NLI data from popular Chinese Sentence corpora, like sentence pairs PAWS-X  [25]/BQ [2] or regression sentence pairs STS-B [13], and so on.

To solve these two problems, this paper provides a Chinese-centric Cross Domain Contrastive learning framework (CCDC) that adds two features: (a) Domain augmentation Constrative Learning and (b) Pseudo NLI Data Generator . Domain augmentation Constrastive Learning uses out-domain but semantic -alike sentences as hard-negatives, improving cross-domain performance. Pseudo-NLI data generators, which help create \(\langle anchor, positive, negative \rangle \) triplets from classification/regression sentence pair datasets, include business rule-based Hard NLI Generators and neural classifier-based Soft NLI Generators.

In order to better understand the superiority of CCDC, three model embedding spaces are mapped: the original BERT base model, the unsupervised SimCSE model, and the CCDC model. It finds that the anisotropy properties are optimized by unsupervised SimCSE and CCDC, while the CCDC model shows the domain clustering tendency [22]. Additional singular value experiments are visualized, showing that the domain-enhanced Contrastive Learning objective “flats” the singular value distribution of the sentence embedding space, thereby improving consistency.

2 Related Work

2.1 Contrastive Learning

Contrastive Learning aims to learn effective representation by pulling semantically close neighbors together and pushing apart non-neighbors [7]. It assumes a set of paired examples \(D = f\{(x_i, x_i^+)\}^m_{i=1}\), where \(x_i\) and \(x_i^+\) are semantically related. Following the contrastive framework in [3], the training object is a cross-entropy loss with in-batch negatives: let \(h_i\) and \(h_i^+\) denote the representations of \(x_i\) and \(x_i^+\), for a mini-batch with N pairs, the training objective for \((x_i, x_i^+)\) is:

$$\begin{aligned} l_i=-log\frac{e^{sim(h_i, h_i^+)/\tau }}{\sum _{j=1}^Ne^{sim(h_i,h_j^+)/\tau }}, \end{aligned}$$
(1)

where \(\tau \) is a temperature hyperparameter and sim(h1, h2) is the cosine similarity of \(\frac{h_1^Th_2}{||h_1||\cdot ||h_2||}\).

2.2 Unsupervised SimCSE

The idea of unsupervised SimCSE is extremely simple in a minimal form of data augmentation, where positive pairs are \((x_i, x_i)\) , compared with the traditional \((x_i, x_i^+)\). Unsupervised SimCSE takes exactly the same sentence as the positive pair, and its embeddings only differ in dropout masks.

$$\begin{aligned} l_i=-log\frac{e^{sim(h_i^{z_i}, h_i^{z_i'})/\tau }}{\sum _{j=1}^N e^{sim(h_i^{z_i}, h_j^{z_j'})/\tau }}, \end{aligned}$$
(2)

where \(h_i^{z_i}, h_i^{z_i'}\) are the same sentence \(x_i\) with different dropout presentations \(z_i, z_i'\).

Fig. 2.
figure 2

CCDC framework

2.3 Supervised SimCSE

For supervised SimCSE, an easy hard-negative mining strategy is added that extends \((x_i, x_i^+ )\) to \((x_i, x_i^+, x_i^-)\) . Prior work [6] has demonstrated that supervised Natural Language Inference (NLI) datasets [21] are effective for learning sentence embeddings, by predicting the relationship between two sentences and dividing it into three categories: entailment, neutral, or contradiction. Contradiction pairs are taken as hard-negative samples and give model significant negative signals.

$$\begin{aligned} l_i = -log\frac{e^{sim(h_i;h_i^+)/\tau }}{\sum ^N_{j=1}{e^{sim(h_i,h_j^+)/\tau }+e^{sim(h_i,h_j^-)/\tau }}}, \end{aligned}$$
(3)

where \(h_i,h_j^+\) are positive pairs labelled as entailment, while \(h_i,h_j^-\) are hard-negative pairs labelled as contradiction.

2.4 Sentence Contrastive Learning with PLMs

The recent success of comparative learning is heavily dependent on the pre-trained language models (PLMs), like BERT [18], Roberta [11], and Albert [9]. Unsupervised/Supervised SimCSE [5], PairSupCon [26], IS-BERT [20], BERT-Position [19], BERT-whitening [16] are based on BERT, Roberta, Albert, and so on, for the pre-training model to improve the training efficiency and result.

Table 1. CCDC samples

3 CCDC Framework

3.1 Cross-Domain Sentences as Hard-Negative Samples

In order to enhance the sentence representation effect of multi-domain contrastive learning [23, 24], the CCDC framework is designed as follows based on the supervised SimCSE framework. The pseudo-NLI data format is similar to the SimCSE NLI format in that the former uses (DT:sen0, DT:sen1, DT:sen2) similar to the (anchor, positive, negative) triplet, where DT is short for domain tag. Anchor-negative pairs include negative examples of in-batch, in-domain, and out-domain, as highlighted in yellow, orange, and red in Fig. 2. The in-domain negative example and out-domain negative example can be considered as hard-negative samples, as can be seen in Table 1.

3.2 Hard NLI Data Builder

To construct pseudo-NLI data, the \((x, x+, x-)\) triplet needs to be generated based on three traditional sentence semantic problems, including classification, regression, and NLI. The Hard NLI Data Builder based on domain rules is implemented as follows. If \((x_i, x_j)\) is a positive sample of semantic similarity (classification problem) or the similarity is greater than the average (regression problem), a negative sample of a sentence \(x_k\) is randomly selected to form an NLI triplet \((x_i, x_j, x_k)\); if \((x_i, x_j)\) is a negative sample or is less than or equal to the average value, the anchor is repeated to form an NLI triplet \((x_i, x_i, x_j)\). The Hard NLI Data Builder process is as in Fig. 3.

As can be seen, classification/regression data is classified into entailment (positive) and contradiction (negative) categories. The created NLI data can be used to train supervised SimCSE.

3.3 Soft NLI Data Builder

In addition to the rule-based Hard NLI Data Builder, a Soft NLI Data Builder can be built based on a neural network classification model. The Entailment/Contradiction classifier can be encoded like a Siamese network, where two towers of BERT and pooling have shared weights, and output sentence embeddings as a feature. A Softmax Binary Classifier is a simple MLP network based on the feature triplet of \(( f(x), f(x'), |f(x)-f(x')|)\), as can be seen in Fig. 3.

Fig. 3.
figure 3

Diagrams of Hard NLI data builder (left) and Soft NLI data builder (right)

4 Experiment

4.1 Data Preparation

The multi-domain semantic representation of sentences is like a different view enhancement of general semantic representation. Like [16], in order to verify the multi-view enhancement effect of the CCDC framework, the most famous Chinese sentence question pair datasets, Ant Financial Artificial Competition(ATEC) [4], BQ [2], LCQMC [10], PAWS-X from Google [25], and STS-B [2] are used. The detailed datasets are shown in Table 2.

Table 2. Chinese sentence pair datasets in 5 domains

4.2 Training Details

SBERT [14] is choosen as the training framework, and tiny, small, base, and large pre-trained models are selected for comparison. The training device is NVIDIA-V100, which has a 32G GPU. The batch size is 128 for tiny/small/base PLMs, 96 for huge PLM due to insufficient graphics memory, and the temperature is 0.05 using an Adam [8] optimizer. The learning rate is set as 5e-5 for tiny/small/base models and 1e-5 for large models, and warm-up steps account for 10% of the total training steps. Just like supervised SimCSE, our model is trained with 3 epochs.

4.3 CCDC with One-Domain Training and In-Domain/Out-Domain Testing

Like in [16], BERT-base is chosen as the baseline and unsupervised SimCSE as the strong baseline.

ATEC is used as training/testing data for in-domain experiments. The ATEC Spearman coefficient reached 46%, a performance improvement of 31% over the baseline and of 13% over the strong baseline. The other four in-domain experiments also achieved a 25–56% and 2–21% performance improvement over the baseline and strong baseline respectively, as can be seen in Table 3.

In the In Domain-Enhanced Confusion Matrix, all items are semi-positive, and most of them are positive. In-domain CCDC training can improve cross-domain performance, as can be seen in Table 4.

Table 3. CCDC with one-domain training and in-domain testing
Table 4. CCDC results with one-domain training and out-domain testing
Table 5. CCDC results with all-domain training and the Hard/Soft NLI data builder

4.4 CCDC with the Hard/Soft NLI Data Builder

With the Hard NLI Data Builder, a multi-domain CCDC model is trained. Domains ATEC, BQ, LCQMC, and PAWS-X all achieved the best performance of 46%, 66%, 74%, and 33% respectively, with BQ being especially better than in-domain SOTA. Only STS-B falls short by 4%, maybe because the training data volume is insufficient. The average performance achieved 57%.

To eliminate the impact of data imbalances, a multi-domain model is trained with the Soft NLI Data Builder [24]. The CCDC model achieved the best performance in all domains, and even PAWS-X outperformed the in-domain SOTA by 2%. The average performance of all domains achieved 58%, an improvement of 41% and 12% over the baseline and the strong baseline respectively in Table 5.

5 Analysis

Based on the research of [5] on neurolinguistic representation, the baseline of the traditional pre-training model has two problems. (1) Due to the anisotropy of the space, the space will eventually shrink to a narrow cone. (2) The eigenvalues corresponding to the eigenvectors will be attenuated greatly, leading to a huge gap between the head feature and the back feature.

Empirical visual analysis for Anisotropy. visual analysis has been performed on the baseline, strong baseline, and CCDC models. 5000 data entries were extracted from each test set, with a total of 50,000 sentences, and three sentence vectors are visualized. As shown in Fig. 1, the original PLM model has an obvious narrow cone phenomenon, and unsupervised SimCSE shows that all directions are evenly distributed, but there is no multi-domain feature. The CCDC avoids the narrow cone phenomenon and also shows some multi-domain characteristics. It has the most appropriate representation of each domain and has some degree of differentiation between different domains.

Singular Values Decay. In addition, in singular value analysis (Fig. 4), the large gap between the head singular value of the traditional PLM and other singular values is well narrowed in unsupervised SimCSE, while the CCDC model supports sentence representation of multiple domains while still maintaining the uniform characteristics of singular values. In terms of homogeneity of the singular value distribution, the CCDC is comparable to unsupervised SimCSE.

Fig. 4.
figure 4

Singular analysis on BERT, SimCSE, and CCDC (one-domain/all-domain)

6 Conclusion

The paper proposes the CCDC, a multi-domain enhanced contrastive learning sentence embedding framework, which uses a pseudo-NLI data generator to obtain a multi-domain sentence representation model that significantly outperforms the baseline model (BERT) and the strong baseline model (unsupervised SimCSE). Deep analysis shows that the CCDC framework solves the anisotropy and eigenvalue attenuation problems well.

Future research will focus on the knowledge mining in comparative learning. As shown in the preceding analysis, the CCDC performance is less than 50% in PASW-X. It should be able to perform hard-negative mining or multi-view mining based on knowledge graph or terminology knowledge [24] on hard cases such as “Scotland to England” and “England to Scotland” in PASW-X.