A Hierarchical Attention Based Seq2Seq Model for Chinese Lyrics Generation

Fan, Haoshen; Wang, Jie; Zhuang, Bojin; Wang, Shaojun; Xiao, Jing

doi:10.1007/978-3-030-29894-4_23

Haoshen Fan^10,11,
Jie Wang¹¹,
Bojin Zhuang¹¹,
Shaojun Wang¹¹ &
…
Jing Xiao¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11672))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

2982 Accesses
8 Citations

Abstract

In this paper, we comprehensively study on context-aware generation of Chinese song lyrics. Conventional text generative models generate a sequence or sentence word by word, failing to consider the contextual relationship between sentences. Taking account into the characteristics of lyrics, a hierarchical attention based Seq2Seq (Sequence-to-Sequence) model is proposed for Chinese lyrics generation. With encoding of word-level and sentence-level contextual information, this model promotes the topic relevance and consistency of generation. A large Chinese lyrics corpus is also leveraged for model training. Eventually, results of automatic and human evaluations demonstrate that our model is able to compose complete Chinese lyrics with one united topic constraint.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Hierarchical attention based long short-term memory for Chinese lyric generation

Article 15 June 2018

Chinese Lyrics Generation Using Long Short-Term Memory Neural Network

Greek Lyrics Generation

Keywords

1 Introduction

Natural language generation (NLG) (Mann 1982), also known as text generation, is one of most important tasks in the field of natural language processing (Chowdhury 2003). NLG has been extensively studied in many applications, such as dialogue system (Chen et al. 2017), machine translation (Cho et al. 2014), text summarization (Nallapati et al. 2016) and so on. In this paper, however, we concentrate on Chinese lyric text generation. Different from prose texts, lyrics exhibits its own significant characteristics, including rhyme, rhetoric and repeated structures. In the perspective of narrative, a lyrics paragraph always concentrates on one main topic due to its limited length, which is totally different from long documents often covering several topics. Moreover, sentence length of lyrics is always short, in a range of 8 to 15 words, which results in close contextual relationship between adjacent sentences.

In general, most of text generation models can be extended for lyrics generation. In the area of text generation, there exist two main approaches, one of which is probabilistic language model (LM) and the other is Sequence-to-Sequence (Seq2Seq). LM has been successfully used in various NLG applications, which is capable of predicting next words on the premise of prior contexts. For instance, Bengio used n-gram model of three layers to construct a language model (Bengio et al. 2003). Then, Mikolov promoted LM with recurrent neural network (RNN) (Mikolov 2010). However, LM even with long-short term memory (LSTM) network would suffer from semantic shift along with the accumulation of sequence length (Hochreither and Schmidhuber 1997). To address sequence transduction between heterogeneous data, a sequence-to-sequence model was proposed (Sutskever et al. 2014). Taking a sequence as input, Seq2Seq can encode it into a fixed dense vector and then decode to another sequence. Moreover, Bahdanau applied the attention mechanism to the Seq2Seq in order to diffuse decoding weights into different parts of input (Bahdanau et al. 2015). Based on Seq2Seq, text generation can be defined as next sentence prediction on the premise of prior sentences. In most of Seq2Seq applications, however, input contexts are formed based on sequential concatenation of previous sentences directly. Consequently, the semantic effect of sub-sequences far from the decoder could become weaker on prediction.

To generate long-paragraph Chinese lyrics with high contextual relevance and consistence, in this paper, we propose a hierarchical recurrent encoder (HRE) incorporated into the seq2seq framework. HRE can extract both sentence-level and word-level semantics from prior sentences, providing more contextual information for decoding. Moreover, the attention mechanism covering the most adjacent sentence is applied, considering the closest connection with next prediction. The rest of the article will be structured as follows: Sect. 4.1 describes the data preprocess of Chinese lyrics corpus, Sect. 3 describes the details of our model, Sect. 4 describes the experiments on several models, Sect. 2 briefly introduces the related work and we make some conclusion in Sect. 5.

2 Related Work

NLG is an essential part of natural language processing (NLP). According to the modality of input, there exist text-to-text generation, meaning-to-text generation, data-to-text generation, image-to-text generation etc. In this paper, lyrics generation is modeled as a specific text-to-text generation with previous sentences as input. Similar tasks including Chinese poetry generation (Wang et al. 2016), essay generation (Feng et al. 2018) and comment generation (Tang et al. 2016) have been extensively studied. Chinese poetry generation generate a kind of hierarchical text with strict format which often has a fixed number of sentences and each sentence has a fixed number of words. For instance, to generate context-aware comments, Tang proposed to encode the context as a continuous semantic representation into a basic RNN model. Moreover, essay generation covering several topic words has also been demonstrated by similar methods.

Various hierarchical models have been used for generating coherent long texts. For example, Li proposed a hierarchical neural auto-encoder to build an embedding for a paragraph (Li et al. 2015). Lin presented a novel hierarchical recurrent neural network language model (HRNNLM) to maintain overall coherence in a document (Lin et al. 2015). Following the HRED proposed by Sordoni (2015), Serban extended the hierarchical model to promote dialogue generation with long-term contexts (Serban and Bengio et al. 2016). Later, he enhanced the HRED model with a latent variable at the decoder (Serban and Sordoni et al. 2016). Furthermore, a hierarchical seq2seq with attention is proposed by us for Chinese lyrics generation to address the long-term coherence.

3 Model

In this section, a hierarchical attention based Seq2Seq model for lyrics generation is described. Original lyrics has been preprocessed into the paragraph format for model training in advance. Here, a lyrics paragraph comprises a sequence of $ M $ sentences, i.e. $ P = \left\{ {S_{1} ,S_{2} , \ldots ,S_{M} } \right\} $. Each sentence $ S_{m} $ consists of a sequence of $ N_{m} $ words $ S_{m} = \left\{ {\omega_{m,1} ,\omega_{m,2} , \ldots \omega_{{m,N_{m} }} } \right\} $, where $ \omega_{m,n} $ represents the word at position n in sentence m.

3.1 Recurrent Neural Network

A recurrent neural network (RNN) model recurrently calculates a vector named recurrent state or hidden state $ h_{n} $ by taking a sequence of words $ \left\{ {\omega_{1} ,\omega_{2} , \ldots ,\omega_{N} } \right\} $:

$$ h_{n} = f\left( {h_{n - 1} ,\omega_{n} } \right),n \in \left( {1,N} \right),h_{0} = 0 $$

(1)

Particularly, the $ h_{0} $ denote the initial state and always is set as zero at the time of training. Usually, $ h_{n} $ depends on the current word $ \omega_{n} $ and previous ones before the current time step. In Eq. 1, $ f $ denotes a parametrized non-linear function, such as sigmoid, hyperbolic tangent, long-short term memory (LSTM) and gate recurrent unit (GRU). The hidden state will lose long contextual information when a vanilla RNN such as sigmoid or hyperbolic tangent is used. Through bringing in a memory cell, LSTM or GRU can handle longer-term contexts. Moreover, GRU requires less computational cost compared with LSTM. Thus, GRU is used as the RNN cell unit. The equations of GRU are summarized as follows:

$$ z_{t} = \sigma \left( {W_{z} \omega_{t} + U_{z} h_{t - 1} } \right) $$

(2)

$$ r_{t} = \sigma \left( {W_{r} \omega_{t} + U_{r} h_{t - 1} } \right) $$

(3)

$$ \widetilde{\text{h}}_{\text{t}} = { \tanh }\left( {{\text{W}}\upomega_{\text{t}} + {\text{U}}\left( {{\text{r}}_{\text{t}} *{\text{h}}_{{{\text{t}} - 1}} } \right)} \right) $$

(4)

$$ h_{t} = \left( {1 - z_{t} } \right)*h_{t - 1} + z_{t} *\widetilde{h}_{t} $$

(5)

In the Equation above, the $ \sigma $ is the non-linear function i.e. logistic sigmoid, which limits output to range [0, 1]. $ z_{t} $ is the update gate deciding the weight of input information past, and $ r_{t} $ is the reset gate determining the weight of last state. The candidate update $ \widetilde{h}_{t} $ controls the percentage of information obtained from $ h_{t - 1} $ with reset gate. The final update $ h_{t} $ depends on the update gate and candidate update. The subscript letter t represents the time step.

3.2 Hierarchical Recurrent Encoder

Sordoni proposed a hierarchical recurrent encoder-decoder (HRED) to predict a next web query conditioned on previous queries submitted by users (Sordoni et al. 2015). The hierarchical encoder consists of query-level and session-level encoders, which has been demonstrated very successful for web query prediction. Following this HRED work, a lyrics paragraph is considered with hierarchical structure of word-level and sentence-level as shown in Fig. 1. At the bottom of the network, the sentence-level RNN encodes each sentence into a fix dense vector. This higher-level semantic vector is used to predict the next sentence $ {\text{S}}_{{{\text{m}} + 1}} $.

Different from web queries, however, a lyrics paragraph always contains more than ten sentences. Thus, we adapt this HRE to handle a certain number of sentences before decoding as shown in Fig. 2. The number of sub-group sentences is denoted as $ Num $, which is a hyper-parameter. After some trial and error, the $ Num $ is optimized as 5. Note that GRU is used as the basic RNN cell unit. Moreover, the word-level encoder and the decoder share same parameters.

3.3 Decoder

In the decoder, the last state of the sentence-level RNN is used as the initial state. The probability distribution in the time $ {\text{t}} $. represented:

$$ p(\omega_{t} |s,\omega_{1} , \ldots ,\omega_{t - 1} ) = g\left( {h_{t,dec} ,\omega_{t - 1} ,s} \right) $$

(6)

In the Eq. 6, the $ s $ is the last state of the sentence-level encoder. The state $ {\text{h}}_{{{\text{t}},{\text{dec}}}} $ can be denoted as:

$$ h_{t,dec} = f\left( {h_{t - 1,dec} ,\omega_{t - 1} ,s} \right) $$

(7)

S2Seq with attention was first proposed by Dzmitry and has achieved a great success in various NLG applications. Here, the attention mechanism is incorporated into the hierarchical model and applied to the word-level encoder. The difference between seq 2seq with attention and conventional seq2seq is that the decoder uses different context vector $ {\text{s}}_{\text{t}} $ in each step as:

$$ h_{t,dec} = f\left( {h_{t - 1,dec} ,\omega_{t - 1} ,s_{t} } \right) $$

(8)

The context vector $ s_{t} $ is a weighted sum of the encoder hidden states $ \left\{ {h_{1,dec} , h_{2,dec} , \ldots ,h_{{N_{m} ,dec}} } \right\} $:

$$ s_{t} = \sum\nolimits_{j = 1}^{{N_{m} }} {a_{tj} h_{j,enc} } $$

(9)

where the $ a_{tj} $ is computed by decoder hidden state $ h_{t - 1, dec} $ and each encoder hidden state $ \left\{ {h_{1,dec} ,h_{2,dec} , \ldots ,h_{{N_{m} ,dec}} } \right\} $. As shown in Fig. 3, we only use the sequence of hidden states of last sentence $ S_{m - 1} $ as the input of attention while predicting the next sentence $ S_{m} $ because of the strongest semantic relationship between adjacent sentences. Finally, beam search is used in the inference stage.

4 Experiments and Results

In this section, settings of experimental parameters are described at detail. A generic Seq2Seq model is applied as a baseline. Tensorflow framework is used to implement the hierarchical attention based Seq2Seq model because of its flexibility and accumulated development experiences shared in community (Tang 2016).

4.1 Data Processing

Lyrics in monolingual Chinese was collected to guarantee the same data structure. 100,000 pop song lyrics has been prepared, which is familiar with most Chinese Netizens. Based on this corpus, a prior vocabulary of 7030 words was achieved. Filtering out 1985 low frequency words which occur less than 10 times in the paragraphs, our vocabulary size is eventually 5045. Additionally, the following three symbols have been added into this vocabulary, including ‘unk’ representing unknown words, ‘go’ and ‘eos’ donating the start and end of sentences. Besides, the maximum length of all sentences is limited to 20. Those sentences longer than 20 have been filtered out. Finally, the prepared corpus was divided into two parts, 90% as training data while 10% as test data.

4.2 Parameters Setting

We use the word embedding with dimension 300 to represent the words. Specifically, the word embedding is defined as the trainable parameters, which will be fine-tuned as the training progress. The word-level encoder has 1000 hidden unit. To keep the sentences talking the same topic and memorizing complex topics and emotion, we set the dimensionality of sentence-level encoder and decoder to 1500. Moreover, the word-level encoder has 3 layers to ensure the model can encode the complex lyrics sentences while the sentence-level encoder and decoder has 1 layer. Finally, the beam width k is set to 5. All of the parameters are randomly initialized within the range [−0.5, 0.5]. They are trained to minimize the cross-entropy loss function with the Adam optimizer (Kingma and Ba 2015). We set the mini-batch to 256. We train the model until the loss function has a minimum value and is no less than that in the next three epochs.

4.3 Evaluation Metrics

Human Evaluation

Nine Chinese experts are asked to evaluate the performance of our model. They are asked to mark generated lyrics samples from three different aspects: Topic Relevance, Fluency and Semantic Coherence. The score is range from 1 to 5. 5000 lyrics paragraphs are randomly generated for graduate students majored in Music to score.

BLEU

Additionally, we use Bilingual Evaluation Understudy (BLEU) as our automatic evaluation (Papineni et al. 2002). BLEU is an evaluation method widely used for machine translation. In this paper, the test dataset is used as the reference ground truth for automatic evaluation.

4.4 Experimental Results

Table 1 shows the final results of human evaluation of different models. The basic Seq2Seq model exhibits the worst performance since it only considers the adjacent sentences, which can’t maintain the long-term semantic coherence. In comparison, the hierarchical Seq2Seq model boosts the performance in terms of “Topic Relevance” and “Semantic Coherence”. The main reason is that the hierarchical model is able to remember higher-level semantics due to the sentence-level encoding. However, the poor performance of the hierarchical model in “Fluency” is attributed to the omission of word-level encoding. Thus, the hierarchical Seq2Seq with attention performs best in all three perspectives. The attention mechanism helps the model directly connect the semantic relationship between adjacent sentences while retaining higher-level contextual information.

Table 1. Averaged score of different model for lyrics text generation.

Full size table

In order to make the evaluation result more objective, we also show the BLEU result in Table 2. Obviously, the results of BLEU show the same trend as those of human evaluation. The hierarchical Seq2Seq preforms better than Seq2Seq model and the hierarchical Seq2Seq with attention performs better than the hierarchical Seq2Seq. Compared with other area of text generation such as machine translation, the BLEU results are very small. The reason is that the generated lyrics use different word combinations to express the same meaning while each word of the text to be translated often has a unique correct answer. Finally, a sample of generated lyrics is given in Table 3. Those underlined and bold Chinese characters at the ending of sentences are rhyming.

Table 2. BLEU scores of different models.

Full size table

Table 3. Example of generated lyrics. The blue text is the lyrics generated by the hierarchical Seq 2Seq model with attention. Note that the first line “Homeland” is the title of the lyrics.

Full size table

5 Conclusions

In this paper, we propose a novel hierarchical Seq2Seq model with attention for Chinese lyrics generation. A large-scale Chinese lyrics corpus has been leveraged for model training. Results of human and BLEU evaluation demonstrate the effectiveness of this model owing to its sentence-level semantic encoding and attended to adjacent sentences. Moreover, this hierarchical encoder method offers a promising approach of context fusing for other NLG applications.

References

Mann, W.: Text generation. Comput. Linguist. 8, 62–69 (1982)
Google Scholar
Chowdhury, G.G.: Natural language processing. Annu. Rev. Inf. Sci. Technol. 37, 51–89 (2003)
Article Google Scholar
Chen, H., Liu, X., Yin, D., Tang, J.: A survey on dialogue systems: recent advances and new frontiers. ACM SIGKDD Explor. Newslett. 19, 25–35 (2017)
Article Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. Computer Science (2014)
Google Scholar
Nallapati, R., Zhou, B., dos santos, C.N., Gulcehre, C., Xiang, B.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: Conference on Computational Natural Language Learning (2016)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Mikolov, T.: Recurrent neural network based language model. Interspeech 2, 3 (2010)
Google Scholar
Hochreither, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (2015)
Google Scholar
Wang, Z., He, W., Wu, H., Li, W., Wang, H., Chen, E.: Chinese poetry generation with planning based neural network. In: International Conference on Computational Linguistics, pp. 1051–1060 (2016)
Google Scholar
Feng, X., Liu, M., Liu, J., Qin, B., Sun, Y., Liu, T.: Topic-to-essay generation with neural networks. In: International Joint Conferences on Artificial Intelligence, pp. 4078–4084 (2018)
Google Scholar
Tang, J., Yang, Y., Carton, S., Zhang, M., Mei, Q.: Context-aware natural language generation with recurrent neural networks. Computing Research Repository (2016)
Google Scholar
Li, J., Luong, M.T., Dan, J.: A hierarchical neural autoencoder for paragraphs and documents. Computing Research Repository (2015)
Google Scholar
Lin, R., Liu, S., Yang, M., Li, M., Zhou, M., Li, S.: Hierarchical recurrent neural network for document modeling. In: Conference on Empirical Methods in Natural Language Processing, pp. 899–907 (2015)
Google Scholar
Sordoni, A., Bengio, Y., Vahabi, H., Lioma, C., Simonsen J.G., Nie, J.Y.: A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. Computing Research Repository, pp. 553–562 (2015)
Google Scholar
Serban, I.V., Sordoni, A., Bengio, Y., Courville, A., Pineau, J.: Building end-to-end dialogue systems using generative hierarchical neural network models. In: Association for the Advance of Artificial Intelligence, pp. 3776–3784 (2016)
Google Scholar
Serban, I.V., et al.: A hierarchical latent variable encoder-decoder model for generating dialogues. In: Association for the Advance of Artificial Intelligence, pp. 3295–3301 (2016)
Google Scholar
Tang, Y.: TF.Learn: tensorflow’s high-level module for distributed machine learning (2016)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, San Diego (2015)
Google Scholar
Papineni, K., Roukos, S., Ward T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. Association for Computational Linguistics (2002)
Google Scholar

Download references

Acknowledgement

This work was supported by Ping An Technology (Shenzhen) Co., Ltd, China.

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Haoshen Fan
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
Haoshen Fan, Jie Wang, Bojin Zhuang, Shaojun Wang & Jing Xiao

Authors

Haoshen Fan
View author publications
You can also search for this author in PubMed Google Scholar
Jie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bojin Zhuang
View author publications
You can also search for this author in PubMed Google Scholar
Shaojun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jing Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Wang .

Editor information

Editors and Affiliations

Department of Computing, Macquarie University, Sydney, NSW, Australia
Abhaya C. Nayak
RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Alok Sharma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fan, H., Wang, J., Zhuang, B., Wang, S., Xiao, J. (2019). A Hierarchical Attention Based Seq2Seq Model for Chinese Lyrics Generation. In: Nayak, A., Sharma, A. (eds) PRICAI 2019: Trends in Artificial Intelligence. PRICAI 2019. Lecture Notes in Computer Science(), vol 11672. Springer, Cham. https://doi.org/10.1007/978-3-030-29894-4_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-29894-4_23
Published: 23 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29893-7
Online ISBN: 978-3-030-29894-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Hierarchical Attention Based Seq2Seq Model for Chinese Lyrics Generation

Abstract

Similar content being viewed by others

Hierarchical attention based long short-term memory for Chinese lyric generation