Keywords

1 Introduction

Recent years have witnessed remarkable progress in XMTC, with a variety of approaches presented in the literatures and applied in real-world scenarios, such as dynamic search advertising [21] and query recommendation [8].

Different from classical multi-label problems, only a few are head labels with sufficient positive training data, and most labels are tail labels with few positive training data due to large label dimensionality [18, 19, 23] in XMTC. This data sparsity issue leads to insufficient feature learning of tail labels, and hurts prediction performance on overwhelming tail label predictions.

To solve this problem, most existing XMTC methods [3, 9, 9, 15, 18, 20, 21, 23] take advantage of label clusters obtained in early stage to balance performance on tail labels and head labels. The main motivation is that the semantics of head labels is easy to be recognized in the semantic space due to sufficient training data, while the semantics of tail labels is vague. The precise semantics of tail labels can be learned from head labels that may appear in the same cluster. However, existing label clusters are all pre-defined global category patterns due to fixed features of labels. The static and coarse-grained semantic scope provided by such label clusters is not always consistent with dynamic real-world semantic scenarios, where content of different text has different semantic granularity. The previous model establishes structures hierarchy for the labels of a single field, and if the user is likely to be interested in overlapping topics in that and other fields, then when he enters a query into the search engine, he only gets keywords for a single domain due to the prepared label clusters. Thus, we consider developing dynamic semantic scope in the form of fine-grained teacher knowledge to improve tail label predictions accuracy and alleviate the data sparsity issue. We introduce text relevance to increase exposure of tail labels and implement a dynamic label cluster structure to personalise relevant label subsets. In detail, for given instance, We can use the relevant labels of its neighbouring text to link more rare labels. We assume that if a text is related to a label, then the text is also related to its parent label. With the help of hierarchical label information, teacher knowledge is modeled to provide dynamic and fine-grained semantic scope to rich text semantics.

In summary, We propose a novel framework TReaderXML for XMTC containing a novel dual cooperative network based on multi-head self attention mechanism to embed both guidance knowledge and text into a shared semantic space for feature interaction, effectively improving the effect of teacher knowledge. The remainder of the paper is organized as follows. In Sect. 2, we review recent related work. Section 3 introduces TReaderXML. In Sect. 4, experimental results on three XMTC benchmark datasets are shown. Section 5 concludes this work.

2 Related Work

Many methods have been proposed for addressing the data sparsity issue of XMTC. They can be categorized into the following two types: 1) flat based label clusters [3, 23]; 2) tree based label clusters [9, 15, 17, 18, 20, 21]. Tree based label clusters include loss function-based and structure-based.

In flat based label clusters, SLEEC [3] uses text features for clustering. A new text is projected in corresponding clusters, and labels of a new text are obtained by K-Nearest Neighbor to alleviate the data sparsity issue. Based on SLEEC, AnnexML [23] uses label features for clustering based on graph embedding to improve the quality of clusters. In addition, in tree based label clusters, loss function-based method FastXML [21] learns an ensemble of trees which clusters the label space by optimizing a normalized Discounted Cumulative Gain (nDCG) loss function, and PfastreXML [9] replaces the nDCG loss in FastXML by its propensity scored variant which assigns higher rewards for tail label predictions. Furthermore, for structure-based methods in tree based label clusters, Parabel [15] generates a label tree by recursively clustering labels into two balanced groups to address the data sparsity issue. However, the clustering depth of Parabel is deep, which leads to error cascade problems and affects tail label predictions. Bonsai [18] uses shallow and diverse probabilistic label trees (PLTs) by removing the balance constraint in the tree construction of Parabel, which improves tail label predictions. This tree structure-based label cluster optimization is also applied to AttentionXML [17]. AttentionXML optimizes the structure of PLTs to obtain shallow and wide clusters, which improves tail label predictions.

These label cluster methods provide static and coarse-grained semantic scope for every text. It is not always consistent with dynamic real-world semantic scenarios, and reduces the precision of prior knowledge.

3 Methodology

3.1 Notation

Given a training set \(\{(x_{i}, y_{i})\}_{i=1}^{N}\) where \(x_i\) is text input sequence, and \(y_i\in \{0,1\}^{L}\) is the label of \(x_i\) represented by L dimensional multi-hot vectors. Each dimension in \(y_i\) corresponds to a label where \(y_{ij}=1\) when the j-th label \(L_{y_{ij}}\) is associated with \(x_i\). In this paper, we introduce teacher knowledge in a training set \(\{(x_{i}, y_{i}, y_{i}^{\prime })\}_{i=1}^{N}\) where \(y_{i}^{\prime }\) represents the text \(x_i\)’s corresponding teacher knowledge.

3.2 TReaderXML

TReaderXML adopts dynamic and fine-grained semantic scope from teacher knowledge for an individual text to optimize text prior category semantic ranges. Before teacher knowledge helps read text semantics, we need a powerful feature extraction to obtain high dimensional features of semantic scope from teacher knowledge and a text respectively, and embed both of them into a shared semantic space. Then the high dimensional semantics of scope with prior knowledge helps read high dimensional semantics of a text. Based on the above motivations, we design four layers: 1) Encoding, 2) Reading, 3) Iteraction and 4) Predicting. Furthermore, a dual cooperative network contains two layers of Reading and Iteraction, and Fig. 1 shows the framework of TReaderXML.

Encoding. In this part, we design a structure of representation to obtain fine-grained semantic scope extended by teacher knowledge matrix \(E_{y_i'}\) and \(E_{x_i}\). Given a training text \({x_{i}}\), its vectorization is shown as follows:

$$\begin{aligned} V_{x_{i}}=\frac{\sum _{c=1}^{{Len}({x_{i}})} {\text {Encode}}(\textrm{x}_{ic})}{{Len}(\mathrm {x_{i}})}. \end{aligned}$$
(1)
Fig. 1.
figure 1

An overview of TReaderXML.

Traverse \({x_{z}}\) from training set and validation set to find the nearest neighourhoods of \({x_{i}}\) by using consine similarity:

$$\begin{aligned} score_{cos}(V_{x_{i}},V_{x_{z}})=\frac{{(V_{x_{i}} \cdot V_{x_{i}})}}{\parallel V_{x_{i}}\parallel \parallel V_{x_{z}}\parallel }. \end{aligned}$$
(2)

and return top k nearest neighbourhoods \(y_{i}^{nearest}\). To get the semantic scope, Gargiulo [7] uses all of ancestor labels of text labels \(y_i\) in a label tree to introduce hierarchical label information, effectively utilizing label semantic structural relation information. However, it leads to error cascade problems [18] due to the deep hierarchical structure. Furthermore, the semantics of deep hierarchical labels is often abstract, and it reduces the precision of prior knowledge. Inspired by these observations, we only use parent labels of child labels \(y_i\) in a hierarchical label tree to introduce hierarchical label information. With the advantage of low error and high precision of hierarchical label information, teacher knowledge is modeled to provide dynamic and fine-grained semantic scope to help read text semantics. As shown in Algorithm 1, we firstly find the most relevant labels of \(x_{i}^{nearest}\) and its non-empty parent labels. And then we put them into the label subset \(SET^{nearest}\). Each label description information can be generated with widely used tricks in Parabel [15]. To keep an input sequence consistent with the semantic scope of teacher knowledge, we also initialize an embedding for an input sequence \(x_i\), and the processing formula is shown as: \(E_{x_i}=\text {Encode}(x_i)\).

Reading. In this part, we design a structure of Reading to obtain high dimensional features of semantic scope from a teacher knowledge and a text respectively and embed both of them into a shared semantic space for the preparation of feature interaction. This component in a dual cooperative network plays a key role including a mask multi-head self attention (MMHSA) layer, a multi-head self attention (MHSA) layer and residual network.

To obtain high dimensional features of semantic scope from teacher knowledge and a text respectively, we design the structure of Reading based on the self-attention mechanism [2], which contains a MMHSA layer, a MHSA layer and a residual layer. MMHSA masks the future sequence information, and depends on existing sequence information to predict the next word in a sequence. We consider MMHSA as the first layer of Reading to capture more fine-grained semantic information due to the masking in MMHSA. Furthermore, MHSA makes each word contain other semantic information of words in a text input sequence, and we consider MHSA as the second layer of Reading to capture overall semantic information. The processing formula of masking in MMHSA is shown as follows:

$$\begin{aligned}&d_{k}=d_{\text{ model }} / / h, \end{aligned}$$
(3)
$$\begin{aligned}&Q_{y_i'}=E_{y_i'}W_{y_i'}^{q},\text { } Q_{x_i}=E_{x_i}W_{x_i}^{q}, \end{aligned}$$
(4)
$$\begin{aligned}&K_{y_i'}=E_{y_i'}W_{y_i'}^{k},\text { } K_{x_i}=E_{x_i}W_{x_i}^{k}, \end{aligned}$$
(5)
$$\begin{aligned}&V_{y_i'}=E_{y_i'}W_{y_i'}^{v},\text { } V_{x_i}=E_{x_i}W_{x_i}^{v}, \end{aligned}$$
(6)
$$\begin{aligned}&Score_{y_i'}=\frac{Q_{y_i'} \cdot K_{y_i'}^{T}}{\sqrt{d_{k}}},\text { } Score_{x_i}=\frac{Q_{x_i} \cdot K_{x_i}^{T}}{\sqrt{d_{k}}}, \end{aligned}$$
(7)
$$\begin{aligned}&Score_{y_i'}=Mask(Score_{y_i'}, W_{y_i'}^{mask}), \end{aligned}$$
(8)
$$\begin{aligned}&Score_{x_i}=Mask(Score_{x_i}, W_{x_i}^{mask}), \end{aligned}$$
(9)
$$\begin{aligned}&H_i^{y_i'}={\text {Softmax}}(Score_{y_i'}) \cdot V_{y_i'}, \end{aligned}$$
(10)
$$\begin{aligned}&H_i^{x_i}={\text {Softmax}}(Score_{x_i}) \cdot V_{x_i}. \end{aligned}$$
(11)

where \(d_{model}\) is dimension of embedding, and h is the number of attention heads. \(W_{y_i'}^q\), \(W_{y_i'}^k\), \(W_{y_i'}^v\), \(W_{x_i}^q\), \(W_{x_i}^k\), and \(W_{x_i}^v\) are weight matrices of random initialization. \(W_{y_i'}^{mask}\) and \(W_{x_i}^{mask}\) are upper triangular matrices. For Mask(AB), positions where the value of B is 0 are mapped into A, and the value of these positions are set to minus infinity in A. Then infinity values in A will become 0 after Softmax, and masking has been achieved. The attention output \(H_i^{y_i'}\) and \(H_i^{x_i}\) learned by each head will be concatenated and transformed by multiplying a vector respectively. The output of MMHSA can be expressed by the formula given below:

$$\begin{aligned}&MMHSA_{E_{y_i'}}={\text {tanh}}\left( \left\{ H_{1}^{y_i'}, \ldots ; H_{h}^{y_i'}\right\} \cdot W_{y_i'}^{MH}\right) , \end{aligned}$$
(12)
$$\begin{aligned}&MMHSA_{E_{x_i}}={\text {tanh}}\left( \left\{ H_{1}^{x_i}, \ldots ; H_{h}^{x_i}\right\} \cdot W_{x_i}^{MH}\right) . \end{aligned}$$
(13)
figure a

Compared with MMHSA, the processing formula of MHSA omits formulas (6) and (7). To compensate for the loss of semantic information caused due to the masking techniques, we therefore introduce a residual network to enhance the robustness of the model and the expressiveness of the network:

$$\begin{aligned}&E_{y_i'}^{residual}=E_{y_i'}+MMHSA(E_{y_i'}),\end{aligned}$$
(14)
$$\begin{aligned}&E_{x_i}^{residual}=E_{x_i}+MMHSA(E_{x_i}). \end{aligned}$$
(15)

The design of Reading simulates the process of reading a text. Firstly a teacher and a student respectively read verbatim to understand details of texts with MMHSA, and then read comprehensively to understand themes of texts with MHSA. Furthermore, the first layer of a dual cooperative network simulates the preparation of a teacher teaching a student to read. A teacher will prepare the key points of a text and a student will preview a text to achieve the best performance of reading. The first preparation work for the process of cooperation in a dual cooperative network has been achieved, and both semantic scope and text have been embedded into a shared semantic space.

Iteraction. The high-dimensional semantic scope generated by the interaction process between the teacher knowledge and the text provides a deeper understanding of the semantics of the text.

We assume that \(O_{y_{i}'}\) represents the output of Reading from teacher knowledge, and \(O_{x_{i}}\) represents the output of Reading from an input sequence. Firstly \(O_{y_{i}'}\) and \(O_{x_{i}}\) are concatenated, then prior knowledge helps read text semantics with a MMHSA layer like the first layer in Reading. The processing formula is shown as follows:

$$\begin{aligned}&O_{concat}=[O_{y_{i}'} ; O_{x_{i}}],\end{aligned}$$
(16)
$$\begin{aligned}&O_{Interaction}= O_{concat}\cdot W^{HB}. \end{aligned}$$
(17)

where \(W^{HB}\) represents the weight matrix of the hidden bottleneck layer. The bottleneck layer is properly constructed to significantly reduce the model size without degrading the network performance. The design of Interaction containing a MMHSA layer simulates the process of a teacher teaching a student to read word by word. The second cooperation work in a dual cooperative network has been achieved, and the semantics of semantic scope and text has been enhanced in this network for better label prediction.

Predicting. Finally, a softmax layer is applied to predict final labels. The processing formula is shown as follows:

$$\begin{aligned}&Y = \text {Softmax}(O_{Interaction} \cdot W^{Output}). \end{aligned}$$
(18)

where \(W^{Output}\) is the output weight matrix of fully connected layer.

Loss Function. We measure the performance with multi-label one-versus-all loss based on max entropy principle, which are widely used in classification taskes. Specifically, for a predicted score vector Y and a ground truth label vector \(y_{i}\), the processing formula is shown as follows:

$$\begin{aligned} Loss_i(Y, y_{i})=&-\sum _{j=1}^{L} y_{ij} \times \log \left( (1+\exp (-Y_j))^{-1}\right) \nonumber \\&+\, (1-y_{ij}) \times \log \left( \frac{\exp (-Y_j)}{1+\exp (-Y_j)}\right) . \end{aligned}$$
(19)

4 Experiments

4.1 Datasets and Preprocessing

Datasets. Three XMTC benchmark datasets, which have rich hierarchical information and label descriptionare, used for experiments in this paper, including AmazonCat-13K [12], EURLexFootnote 1 and RCV1 [5]. Table 1 shows the statistics of three datasets.

Table 1. Data statistics of three XMTC datasets.

Preprocessing Details. For AmazonCat-13K, we truncate each input sequence after 300 words, and label description after 4 words in the same way as Parabel [15]. Word embedding in AmazonCat-13K we use comes from AttentionXML [17]. For EURLex, we truncate each input sequence after 500 words, and each label description after 4 words. Word embedding in EURLex we use also comes from datasets. For RCV1, we truncate each input sequence after 250 words, and each label description after 16 words. Pre-trained Word2Vec [22] word embedding of 400 dimensions is used in RCV1.

The results of most these baseline methods are obtained from XMTC papers [11, 14, 17], and we have replicated unpublished results with original papers’ codes. The word embedding training of RCV1 refers to methods [16, 22]. The evaluation function implementation refers to the paper [10]. The framework of model training refers to the method [6]. The experimental code on tail labels refers to AttentionXML [17]. The implementation of MHSA refers to the paper [4], and the number of attention heads h in TReaderXML is set to 4. The initial learning rate for TReaderXML training is 0.0001. After the model converges, learning rate attenuation is used to further improve scores, and Adam [13] is used for all deep learning model training. Our experimental configuration has a GPU of RTX 2080 Ti, and 128GB memory. When duplicating AnnexML [23] on EURLex dataset, it cannot be duplicated due to memory problems.

4.2 Baselines

We compare our proposed TReaderXML to the most representative XMTC methods that address data sparsity issue including AnnexML [23], PfastreXML [9], Parabel [15], FastText [1], Bonsai [18], XML-CNN [11], and AttentionXML [17]. Table 2 compares TReaderXML with baseline methods, and the results with stars are from XMTC papers [11, 14, 17] directly.

The proposed TReaderXML outperforms all XMTC methods for most evaluation metrics, and for a few metrics it achieves results comparable to the current approaches. Our method TReaderXML outperforms all XMTC methods, except for being slightly worse than LightXML (P@1) on AmazonCat-13K. Compared to leading extreme classifiers, TReaderXML can up to 0.16% better in P@1 metric on RURLex. For the results of RCV1, TReaderXML has a substantial improvement at P@1. We consider that the precision of TReaderXML in the first predicting position is more accurate due to effective prior knowledge and TReaderXML remains close to existing XMTC methods in other evaluation metrics due to the small label dimensionality of RCV1.

4.3 Evaluation Metrics

Classification accuracy is evaluated according to Precision at k (P@k), normalized Discounted Cumulative Gain at k (nDCG@k) and Propensity Scored Precision at k (PSP@k) like AttentionXML [17]. refined

4.4 Ablation Study

We conduct an ablation study as shown in Table 3 to discuss proposed novel structures of a dual cooperative network in TReaderXML. In detail, we explore the effectiveness of the teacher knowledge branch and the Reading part.

Teacher Knowledge. Config. ID 0, 1 shows the effectiveness of teacher knowledge. With dynamic and fine-grained semantic scope from teacher knowledge, Config. ID 1 has improved 5.2% over Config. ID 0 without reading part.

Table 2. Performance of TReaderXML and baseline methods over three datasets (The best results are highlighted in bold).

Reading. Config. ID 2, 3, 4, 6 shows the plausibility of Reading structure. The structure of Config. ID 2 is similar to the effect of a person only reading word by word, and it cannot comprehensively understand themes of texts. The structure of Config. ID 3 is similar to the effect of a person only reading themes of texts, and it cannot carefully understand details of texts. The structure of Config. ID 4 is similar to the effect of a person reading themes of texts firstly then reading details of texts, and it is not always consistent with human reading habits. The structure of Config. ID 6 simulates the process of human reading, reading word by word to understand details of texts and reading comprehensively to understand themes of texts. It is feasible to simulate human reading with the Reading structure. Config. ID 5, 6 shows the effectiveness of residual layer. Config. ID 6 has improved 0.44% over Config. ID 5 with residual part.

Table 3. Ablation study of TReaderXML on AmazonCat-13K (The best results are highlighted in bold).

4.5 Performance on Tail Labels

To evaluate performance of TReaderXML on tail labels, we discuss experiment results of tail labels on AmazonCat-13K dataset which has the most tail labels. From Table 4, we see that TReaderXML achieves SOTA effects at PSP@5, except for being slightly worse than PfastreXML [9] at PSP@1 and PSP@3. PfastreXML replaces the nDCG loss in FastXML [21] by its propensity scored variant which is unbiased and assigns higher rewards for the tail label predictions. However, it leads to a loss in prediction accuracy.

Table 4. Performance on tail labels in AmazonCat-13K (The best results are highlighted in bold).

5 Conclusions

In this work, our method TReaderXML define semantic scope from teacher knowledge, which inherits the strength of hierarchical label information and meanwhile improves dynamic high level category information as semantic supplements and constraints. The proposed dual cooperative network learned semantic information in the way of people reading. Moreover, teacher knowledge can flexibly incorporate prior label information like semantic structures or descriptions.