BIG-FG: A Bi-directional Interaction Graph Framework with Filter Gate Mechanism for Chinese Spoken Language Understanding

Zhang, Wentao; Zeng, Bi; Wei, Pengfei; Hu, Huiting

doi:10.1007/978-3-031-44216-2_25

Wentao Zhang¹¹,
Bi Zeng¹¹,
Pengfei Wei¹¹ &
…
Huiting Hu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14257))

Included in the following conference series:

International Conference on Artificial Neural Networks

Abstract

Spoken language understanding (SLU) primarily entailing slot filling and intent detection has been studied for many years with achieving significant results. However, in Chinese SLU tasks, Some models fail to take word-level information into account, and there is insufficient interaction between slot information and intent information. To address the aforementioned issues, we propose a novel bi-directional interaction graph framework with filter gate mechanism (BIG-FG) for Chinese spoken language understanding, which can make a fine-grained interaction directly with slot information and intent information, while also effectively fusing character-word semantic information. The model consists of two core modules: (1) bi-directional interaction graph (BIG), which is based on a multi-layer graph attention network with the bi-directional connections between intent information, slot information, and adjacent slot information, fully considering the correlation between slot filling and intent detection; (2) filter gate (FG), which enhances fusion performance by solving the problem of semantic ambiguity brought by direct fusion of character-word semantic information. Experiments on two datasets demonstrate that our model outperforms the best benchmark model by 0.39% and 2.65% in the Overall(Acc) evaluation metric, respectively, and accomplishes the state-of-the-arts performance.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Heterogeneous Interaction Graph Network for Multi-Intent Spoken Language Understanding

Article 10 March 2023

A Deep Joint Model of Multi-scale Intent-Slots Interaction with Second-Order Gate for SLU

A joint model based on interactive gate mechanism for spoken language understanding

Article 30 August 2021

Keywords

1 Introduction

In intelligent dialogue systems, spoken language understanding (SLU) is critical, which typically includes two main subtasks: intent detection and slot filling [1]. Given an utterance for example, “play music on youtube”, the intent label is “PlayMusic”, and the slots are labeled in order {O, O, O, B-service}.

In the past, researchers focused mostly on English SLU and proposed different methods including gate-based methods [2,3,4], attention-based methods [5, 6] and GATs-based methods [7, 8]. In contrast to English SLU, Chinese SLU faces the difficulty of word segmentation. When there is an error in the word segmentation in Chinese SLU, it leads to error propagation, and, as a result, a slot filling error. To avoid this problem, [9] established a new collaborative memory network model based on the character to avoid the introduction of word segmentation. [10] introduced a two-stage modeling approach at the character level, exploiting the crossover effects between intent and slot information. However, it is commonly understood that Chinese word segmentation is critical for interpreting slots in an utterance. Given the utterance “我/想/听/稻香 (I want to listen to Rice Fragrance)” as an example, we use “/” to split the words in an utterance. If the model is based on character level, it is likely to wrongly predict the slot of “稻 (rice)” as “Rice_name”. However, by using the information of “稻香 (rice fragrance)” in the word segmentation, the model can easily predict “稻香 (rice fragrance)” slot as “Song”. To inject the word information into the Chinese SLU, [11] introduced a word adapter to combine information about characters and words. However, they did not consider the influence of redundant information in the word adapter. For example, in the example utterance mentioned above, the single word “稻香 (song)” and the single character “稻 (rice_name)” are semantically different, which causes semantic ambiguity when injecting word information, and the model lacks the guidance of slot information on intent detection.

To address these issues, we propose a unique bi-directional interaction graph framework to jointly model slot filling and intent detection, taking into account the correlation between slot information and intent information in Chinese SLU as well as alleviating the redundant information caused by the direct fusion of character-word semantic information: (1) bi-directional interaction graph, which uses slot information and intent information as feature nodes and creates bi-directionally connected edges between slot and intent information, and interacts via a multi-layer graph attention network; (2) filter gate. To fuse character-word semantic information efficiently, the redundant information caused by direct fusion is eliminated utilizing a filter gate fusion mechanism which can control the propagation of effective semantic information.

In summary, the following is the contributions of this work:

We propose a bi-directional interaction graph for interacting with slot and intent information that takes into account the reciprocal facilitation of slot filling and intent detection.
We propose a filter gate to limit the impact of redundant information owing to directly fusing character-word semantic information.
Experiments on CAIS and SMP-ECDT datasets demonstrate that our model outperforms the best benchmark model and accomplishes the state-of-the-arts performance.

2 Related Works

Slot Filling and Intent Detection. The researchers proposed many implicit joint models considering the relationship between slot filling and intent detection tasks [6, 12,13,14]. Essentially, they fail to make an explicit relationship between the two tasks. Later, some of researchers started to explore intent-augmented joint models and proposed many execellent approaches [2, 3, 11, 15,16,17]. Nevertheless, these models do not account for the guiding role of slot information in intent detection. Recently, researchers have begun to explore models in which two tasks guide each other [4, 10, 18,19,20,21].

Graph Neural Networks. Currently, graph neural networks are performing very well in many fields. [22] applied graph attention networks to short text classification. [23] improved the performance of aspect-level sentiment classification by clarifying the dependencies between words through graph attention networks. Due to some limitations of GCN, the researcher proposed new approaches [24, 25]. In SLU, [7, 8] improved model performance by building effective interaction graphs. In our BIG-FG, a bi-directional interaction graph is built based on a multi-layer graph attention network to explicitly model the relationship between the two tasks, fully considering the mutual facilitation between intent detection and slot filling to enhance the performance of the model.

3 Approach

This work contributes to implementing slot filling and intent detection for Chinese SLU. In this section, the proposed approach will be introduced in detail. The overall framework of the model is shown in Fig. 1 (a). Firstly, the text encoding layer is introduced to realize the vectorized representation of characters and words in an utterance. Secondly, we propose the adaptive fusion module to obtain the slot and intent information. Next, the intent nodes and slot nodes are learned through an interrelated connection fusion using the bi-directional interaction graph. Finally, through a cooperative learning schema, slot filling and intent detection are optimized concurrently.

3.1 Text Encoding Layer

Following [11], we utilize a novel text encoding structure to obtain character-word information representation, which consists primarily of an embedding encoder, a self-attention, and a Bi-LSTM.

Character Encoding. Given a Chinese utterance $\boldsymbol{X}=\{x_1,x_2,x_3,\cdots ,x_T \}$ , T denotes the number of characters. Firstly, each character is transformed into a character vector $\boldsymbol{E}^c=\{\boldsymbol{e}^{c}_{1},\boldsymbol{e}^{c}_{2},\cdots ,\boldsymbol{e}^{c}_{T} \}$, and secondly, high-level semantic information is obtained by self-attention and Bi-LSTM, respectively.

The self-attention [26] captures the features of the context of the characters in the utterance and the Bi-LSTM portrays the sequence from two directions, which can express more semantic information. We feed the characters vector $\boldsymbol{E}^c$ into self-attention and Bi-LSTM, respectively. The output vectors are $\boldsymbol{H}^{A}=\{\boldsymbol{h}_{1}^{A},\boldsymbol{h}_{2}^{A},\cdots ,\boldsymbol{h}_{T}^{A}\}$ and $\boldsymbol{H}^L=\{\boldsymbol{h}_1^L,\boldsymbol{h}^L_2,\cdots ,\boldsymbol{h}^L_T\}$, respectively.

Finally, the semantic information $\boldsymbol{h}^c_t=[\boldsymbol{h}^A_t,\boldsymbol{h}^L_t]$ is obtained by concatenating self-attention and Bi-LSTM. The final output sequence of the character-level semantic information is $\boldsymbol{H}^c=\{\boldsymbol{h}_1^c,\boldsymbol{h}^c_2,\cdots ,\boldsymbol{h}^c_T\}$.

Word Encoding. We use an external CWS (Chinese Word Segmentation) system. Given the Chinese utterance $\boldsymbol{X}$, we obtain the word sequences $\boldsymbol{E}^w=\{\boldsymbol{e}^{w}_{1},\boldsymbol{e}^{w}_{2},\cdots ,\boldsymbol{e}^{w}_{M} \} (M\leqslant T)$ by word segmentation and vectorization. The rest of the encoding part is the same as the character encoding, and the final word-level semantic encoding output is denoted as $\boldsymbol{H}^w=\{\boldsymbol{h}_1^w,\boldsymbol{h}^w_2,\cdots ,\boldsymbol{h}^w_M\}$.

3.2 Adaptive Fusion Module

3.2.1 Fusion Module.

Figure 1(b) represents the structure diagram of the adaptive fusion layer module. As one of the contribution points of this work, a filter gate mechanism is proposed based on bilinear fusion to reduce the impact of redundant information brought by character-word fusion. Following [11], with the word vector $\boldsymbol{v}^w\in \mathbb {R}^d$ and character vector $\boldsymbol{v}^c\in \mathbb {R}^d$ as inputs, we obtain the normalized word and character vectors $\boldsymbol{a}^w$, $\boldsymbol{a}^c$ by Tanh activation function. After that, the weights are calculated using a bilinear function, and then they are weighted and summed to obtain $\boldsymbol{v}^{wc}$. The procedure for calculating is as follows:

$$\begin{aligned} \boldsymbol{a}^w=tanh(\boldsymbol{W}_{aw}\boldsymbol{v}^w+\boldsymbol{b}^{aw} ) \end{aligned}$$

(1)

$$\begin{aligned} \boldsymbol{a}^c=tanh(\boldsymbol{W}_{ac}\boldsymbol{v}^c+\boldsymbol{b}^{ac} ) \end{aligned}$$

(2)

$$\begin{aligned} \lambda =sigmoid(\boldsymbol{v}^c\boldsymbol{W}_{\lambda }\boldsymbol{v}^w + b^{\lambda } ) \end{aligned}$$

(3)

$$\begin{aligned} \boldsymbol{v}^{wc} =(1-\lambda )\boldsymbol{a}^c + {\lambda }\boldsymbol{a}^w \end{aligned}$$

(4)

where $\boldsymbol{W}_{aw}$, $\boldsymbol{W}_{ac}$, $\boldsymbol{W}_{\lambda }$ are the trainable matrix weights and $\boldsymbol{b}^{aw}$, $\boldsymbol{b}^{ac}$, $b^{\lambda }$ are the bias values of the linear transformation.

Considering the potential redundant information, we propose a filter gate to specifically utilize fusion feature. When the fusion feature $\boldsymbol{v}^{wc}$ is advantageous, the filter gate will combine both the fusion and original features, and obtain explicit slot boundary information. The calculation process is as follows:

$$\begin{aligned} \boldsymbol{f_c}=\boldsymbol{W}_{fc}[\boldsymbol{a}^c,\boldsymbol{v}^{wc}]+\boldsymbol{b}^{fc} \end{aligned}$$

(5)

$$\begin{aligned} \boldsymbol{f}_w=\boldsymbol{W}_{fw}[\boldsymbol{a}^w,\boldsymbol{v}^{wc}]+\boldsymbol{b}^{fw} \end{aligned}$$

(6)

$$\begin{aligned} f_g=sigmoid(\boldsymbol{W}_g[\boldsymbol{f}_c,\boldsymbol{f}_w]+b^{wc}) \end{aligned}$$

(7)

$$\begin{aligned} \boldsymbol{v}=f_g*\tanh (\boldsymbol{W}_v\boldsymbol{v}^{wc}+\boldsymbol{b}^g) \end{aligned}$$

(8)

where $\boldsymbol{W}_{fc}$, $\boldsymbol{W}_{fw}$, $\boldsymbol{W}_{g}$, $\boldsymbol{W}_{v}$ are the trainable matrix weights; $\boldsymbol{b}^{fc}$, $\boldsymbol{b}^{fw}$, $b^{wc}$, and $\boldsymbol{b}^{g}$ are the bias values of the linear transformation, [, ] represents the concatenation operation, and $\boldsymbol{v}$ is the final fusion output. The above formula for the adaptive fusion layer can be abbreviated as $ \boldsymbol{v}=AFM(\boldsymbol{v}^c,\boldsymbol{v}^w) $.

Intent Fusion. We employ MLP attention to obtain an informative representation of the entire utterance $\boldsymbol{h}^{mc}\in \mathbb {R}^d$. Similarly, we can also obtain the word-level representation of the information of the whole utterance $\boldsymbol{h}^{mw}\in \mathbb {R}^d$ and get the fused intent information representation $\boldsymbol{h}^I$ through the adaptive fusion layer.

$$\begin{aligned} \boldsymbol{h}^I=AFM(\boldsymbol{h}^{mc},\boldsymbol{h}^{mw}) \end{aligned}$$

(9)

Slot Fusion. By unidirectional LSTM, we can obtain more appropriate slot information $\boldsymbol{h}^{sc}_t$, $\boldsymbol{h}^{sw}_t$. Then, through the adaptive fusion layer, the fused slot information is obtained, denoted as $\boldsymbol{h}^S_t$^{Footnote 1}.

$$\begin{aligned} \boldsymbol{h}^{S}_t=AFM(\boldsymbol{h}^{sc}_t,\boldsymbol{h}^{sw}_{f_{align}(t,\boldsymbol{w})}) \end{aligned}$$

(10)

3.3 Bi-directional Interaction Graph Module

Another contribution point of this work is to carry out the interaction of slot information and intent information. We propose a bi-directional interaction graph module, as shown in Fig. 1. By constructing different edges and feature nodes, a multi-layer graph attention network is utilized to fully interact with the information between the intent and slot.

Graph Attention Network. The GAT is a crucial network structure in the domain of deep learning which utilizes the attention mechanism to perform adaptive weighting of various neighboring edges, significantly enhancing the expressive capability. Given a series of feature nodes $\boldsymbol{Z}=\{\boldsymbol{z_1},\boldsymbol{z_2},\cdots ,\boldsymbol{z_N}\}$, N is the total number of nodes. The graph attention network generates new node features, $\boldsymbol{Z}^{\prime }=\{\boldsymbol{z}_1^{\prime }, \boldsymbol{z}_2^{\prime }, \cdots , \boldsymbol{z}_N^{\prime }\}$ as the output.

$$\begin{aligned} \alpha _{ij} = \frac{exp(f(\boldsymbol{a}^T[\boldsymbol{W}_z\boldsymbol{z}_i,\boldsymbol{W}_z\boldsymbol{z}_j]))}{\sum _{{k^{\prime }}\in N_i}exp(f(\boldsymbol{a}^T[\boldsymbol{W}_z\boldsymbol{z}_i,\boldsymbol{W}_z\boldsymbol{z}_{k^{\prime }}]))} \end{aligned}$$

(11)

$$\begin{aligned} \boldsymbol{z}^{\prime }_i=\mathop {\parallel }\limits _{k=1}^K\sigma (\sum _{j \in N_i}\alpha ^k_{ij}\boldsymbol{W}^k_z\boldsymbol{z}_j) \end{aligned}$$

(12)

where $\alpha ^k_{ij}$ denotes the attention weight at k-th head, $\boldsymbol{W}_z^k$ denotes the k-th trainable matrix weight, and $\parallel $ denotes the concatenation operation. In this work, the multi-headed graph attention layer is directly adopted to the bi-directional interaction graph.

Bi-directional Interaction Graph. The correlation between intent and slot is quite essential in SLU. The slot is a reflection of character-level information and the intent is a reflection of sentence-level information.

For the feature nodes of the bi-directional interaction graph, we concatenate the intent hidden layer representation $\boldsymbol{h}^I$ obtained from the intent fusion layer and the slot hidden layer representation $\boldsymbol{h}^S_t$ obtained from the slot fusion layer to be the nodes of the bi-directional interaction graph $\boldsymbol{H}^{[l]}_g=\{\boldsymbol{h}^{I,[l]},\boldsymbol{h}^{S,[l]}_1,\boldsymbol{h}^{S,[l]}_2,\cdots ,\boldsymbol{h}^{S,[l]}_T\}$. $\boldsymbol{h}^{I,[l]}$ denotes the intent hidden layer feature of the l-th layer and $\boldsymbol{h}^{S,[l]}_t$ denotes the slot hidden layer feature of the l-th layer.

For the edges of the bi-directional interaction graph, we establish a bi-directio- nal connection between the each slot and intent node, and due to a correlation between contexts, bi-directional connections are also established between slot and slot at adjacent location.

In order to fully interact with the feature between the intent and slot, a multi-layer bi-directional interaction graph is constructed. For a bi-direction graph with $(l+1)$ layers of interaction, a hidden layer feature of the bi-direction interaction graph at $(l+1)$-th layer can be obtained, and this hidden layer feature is used as the final output.

$$\begin{aligned} \boldsymbol{H}^{[l+1]}_g=multi\text{- }head\,GAT^{[l]}(\boldsymbol{H}^{[l]}_g) \end{aligned}$$

(13)

$$\begin{aligned} \boldsymbol{h}^{fI},\boldsymbol{h}^{fs}_t=\boldsymbol{h}^{I,[l+1]},\boldsymbol{h}^{S,[l+1]}_t \end{aligned}$$

(14)

where $multi\text{- }head\,GAT^{[l]}$ represents the multi-head graph attention network at l-th layer, $\boldsymbol{h}^{I,[l+1]}$ and $\boldsymbol{h}^{S,[l+1]}_t$ are the intent feature and slot feature at $(l+1)$-th layer, respectively, $\boldsymbol{h}^{fI}$ is the output of the intent feature, and $\boldsymbol{h}^{fs}_t$ represents the output of the slot feature.

Through the linear layer, $\boldsymbol{h}^{fI}$ and $\boldsymbol{h}^{fs}_t$ are used for intent detection and slot filling, respectively. $\boldsymbol{y}^I=softmax(\boldsymbol{W}_{fI}\boldsymbol{h}^{fI})$ and $\boldsymbol{y}^S_t=softmax(\boldsymbol{W}_{fs}\boldsymbol{h}^{fs}_t)$, where $\boldsymbol{W}_{fI}$ and $\boldsymbol{W}_{fs}$ are trainable parameters. $O^I=argmax(\boldsymbol{y}^I)$ is the predicted intent tags and $O^S_t=argmax(\boldsymbol{y}^S_t)$ is the predicted slot labels in an utterance.

3.4 Loss Function

In this work, cross-entropy is employed as the loss function. The training objective for combining intent and slot loss is to reduce the value of the following loss function:

$$\begin{aligned} \mathcal {L}_\theta =-\mu \sum _{i = 1}^{N_I}\hat{y}_i^Ilog(y_i^I) -(1-\mu )\sum _{t = 1}^{T}\sum _{i = 1}^{N_S}\hat{y}_t^{S,i}log(y_t^{S,i}) \end{aligned}$$

(15)

where $N_I$ indicates the number of intent labels, T indicates the number of characters in an utterance, $N_S$ indicates the number of slot labels, $\mu $ is a hyperparameter, $\hat{y}^I$ and $\hat{y}^S$ indicate the true tags of the intent and the true tags of the slot, respectively.

4 Experiment

4.1 Datasets and Evaluation Metrics

To verify the feasibility of the approach, two openly accessible Chinese datasets CAIS [9] and SMP-ECDT^{Footnote 2} [11] are selected to conduct experiments. The CAIS dataset contains 7995 training sets, 994 validation sets, and 1024 test sets. There are 1655 training sets, 413 validation sets, and 508 test sets in the SMP-ECDT dataset. Following [2, 16], we assess the effectiveness of Chinese SLU intent prediction using precision, slot filling using F1 score, and utterance-level semantic frame parsing using overall precision. In this work, the Chinese natural language processing system (Language Technology Platform, LTP) is adopted to acquire Chinese word segmentation^{Footnote 3}.

4.2 Implementation Details

We conduct experiments with the GPU of Tesla A100 and PyTorch framework. All of the model weights begin with a uniform distribution as the initialization. The dropout rate is 0.5. The number of layers of graph attention network is 2, and the hidden layer dimension of each layer is 128, using 8 heads. 1.0 is chosen as the value for the maximum norm of gradient clipping. The L2 norm coefficient is $10^{-6}$. The Adam uses a learning rate of $5\times 10^{-4}$ to update all parameters.

Table 1. Main results on CAIS and SMP-ECDT.

Full size table

4.3 Baseline Models

In order to compare with other researchers’ models, some meaningful baseline models are selected including Slot-Gated [2], CM-Net [9], SF-ID Network [19], MLWA [11], Stack-Propagation [16] and GAIR [10]. We take advantage of these models’ published performance data from the literature [11] on the CAIS dataset. On the SMP-ECDT dataset, we execute the published code of the comparative models utilizing the split test set, with the exception of CM-Net [9] that fails to share codes.

4.4 Main Results

Table 1 shows the primary results and some comparative baselines of the proposed model on the CAIS and SMP-ECDT datasets. From the results, we can notice that the GAIR model without injecting word information performs somewhat better than the MLWA model with injecting word information on all metrics. This result occurs since the GAIR model is two-stage and takes into account the bi-directional correlation between intent and slot information, whereas MLWA just takes into account the influence of intent information on slots, despite the addition of word information. Our proposed BIG-FG model is compared with the GAIR model. On the CAIS dataset, we accomplish a 0.91% increase in Slot (F1), a 0.20% increase in Intent (Acc), and a 0.39% increase in Overall (Acc). On the SMP-ECDT dataset, we accomplish improvements of 0.48% on Slot (F1), 1.16% on Intent (Acc), and 2.65% on Overall (Acc). From the results, we observe that our model performs better than the top baseline model and achieves state-of-the-art performance. We attribute the improvement to the following reasons: (1) Our model in this work introduces word information while using character information, which solves the problem of ambiguous word boundary; (2) Considering the mutual facilitation between slot filling and intent detection, effective interaction is carried out through the BIG module to achieve mutual communication between the two tasks and enhance the model performance; (3) The filter gate based on bilinear fusion reduces redundant information, resulting in enhanced model performance. However, the MLWA model fails to consider the influence of redundant information and the guiding role of slot information on intent detection, and the GAIR model ignores the influence of Chinese word segmentation.

Table 2. Ablation study on CAIS and SMP-ECDT datasets.

Full size table

4.5 Analysis

To investigate the effect of the components in the proposed model, ablation experiments are carried out to verify the effectiveness. Table 2 shows the results of the ablation experiments. In addtion, We also explore the effect of the number of BIG layers and different word segmentors on the performance of the model.

Effect of Intent Fusion Layer. To explore whether the intent fusion layer plays a role in the model, the intent fusion layer is removed in this experiment, meaning that the intent information provided by the characters and words is utilized directly, which is named as w/o intent fusion. From the results in w/o intent fusion row in Table 2, we can observe 1.28% and 1.30% drops on Intent (Acc) on the CAIS and SMP-ECDT datasets, respectively, and other evaluation metrics also decrease, which demonstrates that the intent fusion layer can efficiently fuse the intent information provided by the characters and words, and achieve improved semantic features for intent detection.

Effect of Slot Fusion Layer. Similarly, we remove the slot fusion layer to investigate whether the slot fusion layer enriches semantic knowledge of slot, which is named as w/o slot fusion. The experimental results are shown in w/o slot fusion row in Table 2. Slot (F1) decreases by 2.98% on the CAIS dataset and by 1.65% on the SMP-ECDT dataset. This indicates that directly fusing the slot information provided by characters and words is not effective, and the slot fusion layer in our model can improve the information representation of slot.

Effect of Filter Gate. The filter gate reduces the redundant information that would be produced by the direct fusion of character-word semantic information. To investigate the specific effect of the filter gate in our model, we remove the filter gate and directly fuse character-word semantic information with bilinearity, which is named as w/o filter gate. The w/o filter gate row in Table 2 shows that the Slot (F1), Intent (Acc), and Overall (Acc) decrease on both CAIS and SMP-ECDT datasets, indicating that bilinear fusion is not effective when there is no filter gate, and the filter gate plays an important role in the BIG-FG model.

Effect of Bi-directional Interaction Graph. To explore the effect of the proposed bi-directional interaction graph module, we remove the bi-directional interaction graph module and directly use LSTM as the decoder. The intent detection and slot filling are modeled independently. It is named as w/o BIG in this experiment. The w/o BIG row in Table 2 shows that the evaluation metrics on two datasets drop more, which indicates that the independent modeling of intent and slot information is worse than the explicit joint modeling of intent and slot information and the BIG fully interacts with the features of intent and slot, thus enhancing the overall performance.

Effect of the Number of BIG Layers. To investigate the influence of the number of BIG layers, we plot the relationship between the Overall (Acc) and the number of layers of BIG, as shown in Fig. 2. It can be clearly seen from the figure that the model performance improves with the number of layers of the BIG, and the best performance on the CAIS and SMP-ECDT datasets is achieved when the number of layers of the BIG is 2, after which it decreases to gradually stabilize. The reason is that when there are too many layers, the model tends to exhibit excessive smoothing, resulting in the inclusion of redundant information. In general, choosing the optimal number of layers of BIG can improve the performance of the model.

Effect of Different Word Segmentors. We choose five different word segmentors for our experiments to investigate the impact of different word segmentors on the performance of our model, including Jieba^{Footnote 4}, LTP, PKUSeg^{Footnote 5}, HanLp^{Footnote 6}, and Stanford^{Footnote 7}. Furthermore, we add an additional set of experiments without word segmentation information to validate the benefits of adding word segmentation information. Figure 3 shows the results of the experiments. We use Overall (Acc) as evaluation indice, and we can see that different word segmentation methods perform differently. However, in our model, the LTP method has the best word segmentation effect. It is noteworthy that the model with the addition of word information has better effect than model without the addition of word information, which indicates the effectiveness of word segmentation.

5 Conclusion and Future Work

In this work, we proposed a novel bi-directional interaction graph framework with a filter gate mechanism for Chinese SLU. While reducing redundant information, we effectively fused the semantic information provided at the character level as well as word level. Furthermore, we took advantage of the correlation between intent and slot feature to enrich the semantic representations of intent and slot, thus improving the model performance. Experiments on SMP-ECDT and CAIS datasets showed that our model achieved the best performance. In the future, we will consider adding some prior knowledge and exploring a new fusion mechanism to fully fuse word information to enhance the performance of Chinese spoken language understanding tasks.

Notes

1.
Given a word sequence $\boldsymbol{w}$={“打”,“开”,“相机”}, $f_{align}(t,\boldsymbol{w})$ provides the index of the word that goes with the t-th character in $\boldsymbol{w}$ (e.g.,$f_{align}(1,\boldsymbol{w})$=1,$f_{align}(3,\boldsymbol{w})$=3,$f_{align}(4,\boldsymbol{w})$=3).
2.
https://conference.cipsc.org.cn/smp2019/evaluation.html.
3.
http://ltp.ai/.
4.
https://github.com/fxsjy/jieba.
5.
https://github.com/lancopku/PKUSeg-python.
6.
https://github.com/hankcs/HanLP.
7.
https://stanfordnlp.github.io/CoreNLP/.

References

Tur, G., De Mori, R.: Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. Wiley, Hoboken (2011)
Book MATH Google Scholar
Goo, C.W., et al.: Slot-gated modeling for joint slot filling and intent prediction. In: Proc. NAACL, pp. 753–757 (2018)
Google Scholar
Li, C., Li, L., Qi, J.: A self-attentive model with gate mechanism for spoken language understanding. In: Proceedings of EMNLP, pp. 3824–3833 (2018)
Google Scholar
Sun, C., Lv, L., Liu, T., Li, T.: A joint model based on interactive gate mechanism for spoken language understanding. Appl. Intell. 52, 6057–6064 (2022)
Article Google Scholar
Chen, M., Zeng, J., Lou, J.: A self-attention joint model for spoken language understanding in situational dialog applications. arXiv preprint arXiv:1905.11393 (2019)
Liu, B., Lane, I.: Attention-based recurrent neural network models for joint intent detection and slot filling. In: Proceedings of Interspeech, pp. 685–689 (2016)
Google Scholar
Ding, Z., Yang, Z., Lin, H., Wang, J.: Focus on interaction: a novel dynamic graph model for joint multiple intent detection and slot filling. In: Proceedings of IJCAI, pp. 3801–3807 (2021)
Google Scholar
Qin, L., Wei, F., Xie, T., Xu, X., Che, W., Liu, T.: GL-GIN: fast and accurate non-autoregressive model for joint multiple intent detection and slot filling. In: Proceedings of ACL-IJCNLP, pp. 178–188 (2021)
Google Scholar
Liu, Y., Meng, F., Zhang, J., Zhou, J., Chen, Y., Xu, J.: CM-Net: a novel collaborative memory network for spoken language understanding. In: Proceedings of EMNLP-IJCNLP, pp. 1051–1060 (2019)
Google Scholar
Zhu, Z., Huang, P., Huang, H., Liu, S., Lao, L.: A graph attention interactive refine framework with contextual regularization for jointing intent detection and slot filling. In: Proceedings of ICASSP, pp. 7617–7621 (2022)
Google Scholar
Teng, D., Qin, L., Che, W., Zhao, S., Liu, T.: Injecting word information with multi-level word adapter for Chinese spoken language understanding. In: Proceedings of ICASSP, pp. 8188–8192 (2021)
Google Scholar
Guo, D., Tur, G., Yih, W.t., Zweig, G.: Joint semantic utterance classification and slot filling with recursive neural networks. In: Proceedings of SLT, pp. 554–559 (2014)
Google Scholar
Hakkani-Tür, D., et al.: Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM. In: Proceedings of Interspeech, pp. 715–719 (2016)
Google Scholar
Xu, P., Sarikaya, R.: Convolutional neural network based triangular CRF for joint intent detection and slot filling. In: Proceedings of ASRU. pp. 78–83 (2013)
Google Scholar
Ma, Z., Sun, B., Li, S.: A two-stage selective fusion framework for joint intent detection and slot filling. IEEE Trans. Neural Netw. Learn. Syst. 1–12 (2022)
Google Scholar
Qin, L., Che, W., Li, Y., Wen, H., Liu, T.: A stack-propagation framework with token-level intent detection for spoken language understanding. In: Proceedings of EMNLP-IJCNLP, pp. 2078–2087 (2019)
Google Scholar
Zhou, B., Zhang, Y., Sui, X., Song, K., Yuan, X.: Multi-grained label refinement network with dependency structures for joint intent detection and slot filling. arXiv preprint arXiv:2209.04156 (2022)
Chen, D., Huang, Z., Wu, X., Ge, S., Zou, Y.: Towards joint intent detection and slot filling via higher-order attention. In: Proceedings of IJCAI, pp. 4072–4078 (2022)
Google Scholar
Haihong, E., Niu, P., Chen, Z., Song, M.: A novel bi-directional interrelated model for joint intent detection and slot filling. In: Proceedings of ACL, pp. 5467–5471 (2019)
Google Scholar
Wang, J., Wei, K., Radfar, M., Zhang, W., Chung, C.: Encoding syntactic knowledge in transformer encoder for intent detection and slot filling. In: Proceedings of AAAI, vol. 35, pp. 13943–13951 (2021)
Google Scholar
Zhang, C., Li, Y., Du, N., Fan, W., Yu, P.: Joint slot filling and intent detection via capsule neural networks. In: Proceedings of ACL, pp. 5259–5267 (2019)
Google Scholar
Linmei, H., Yang, T., Shi, C., Ji, H., Li, X.: Heterogeneous graph attention networks for semi-supervised short text classification. In: Proceedings of EMNLP-IJCNLP, pp. 4821–4830 (2019)
Google Scholar
Huang, B., Carley, K.: Syntax-aware aspect level sentiment classification with graph attention networks. In: Proceedings of EMNLP-IJCNLP, pp. 5469–5477 (2019)
Google Scholar
Wang, S., Wu, Z., Chen, Y., Chen, Y.: Beyond graph convolutional network: an interpretable regularizer-centered optimization framework. arXiv preprint arXiv:2301.04318 (2023)
Chen, M., Wei, Z., Huang, Z., Ding, B., Li, Y.: Simple and deep graph convolutional networks. In: Proceedings of ICML, pp. 1725–1735 (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of NeurIPS, pp. 6000–6010 (2017)
Google Scholar

Download references

Acknowledgement

This work was supported in part by the National Science Foundation of China under Grant 62172111 and in part by the Natural Science Foundation of Guangdong Province under Grant 2019A1515011056.

Author information

Authors and Affiliations

School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, 510006, China
Wentao Zhang, Bi Zeng & Pengfei Wei
School of Information Science and Technology, Zhongkai University of Agriculture and Engineering, Guangzhou, 510006, China
Huiting Hu

Authors

Wentao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Bi Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Pengfei Wei
View author publications
You can also search for this author in PubMed Google Scholar
Huiting Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pengfei Wei .

Editor information

Editors and Affiliations

Democritus University of Thrace, Xanthi, Greece
Lazaros Iliadis
Democritus University of Thrace, Xanthi, Greece
Antonios Papaleonidas
Lancaster University, Lancaster, UK
Plamen Angelov
Teesside University, Middlesbrough, UK
Chrisina Jayne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, W., Zeng, B., Wei, P., Hu, H. (2023). BIG-FG: A Bi-directional Interaction Graph Framework with Filter Gate Mechanism for Chinese Spoken Language Understanding. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14257. Springer, Cham. https://doi.org/10.1007/978-3-031-44216-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-44216-2_25
Published: 22 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44215-5
Online ISBN: 978-3-031-44216-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BIG-FG: A Bi-directional Interaction Graph Framework with Filter Gate Mechanism for Chinese Spoken Language Understanding