1 Introduction

For many centuries, music has made great effort to express life experiences, such as personal emotions, stories, love and hope. Besides rhythms, lyrics also play an important role in music. Each lyricist has their own patterns and styles. Automating the process of lyric generation should face and meet the challenge of being meaningful and semantically related to a certain scenario. Traditional automatic lyric generation systems always go through the following process as defining keywords, choosing a template and generating lyric lines word by word using ontologies or other kinds of human defined rules, which ignore the patterns and styles of certain lyricists and suffer from improper lyric construction and low style maintenance.

To learn the patterns and styles of Chinese lyricists automatically and generate lyrics with full contextual support, a Chinese lyric generation system is built with the state- of-the-art deep learning algorithms. A long short-term memory (LSTM) encoder and decoder network is utilized to process each lyric line and generate next line word by word. A hierarchical context attention model is designed to capture the contextual information at both sentence and document level by learning a high level contextual representation of every lyric line and the entire lyric document. All the contextual attentions are then fed into the LSTM decoder to generate lyrics with contextual support automatically.

For the training process, a lyric with multiple lines is fed into the proposed model at a time to capture the statistic patterns and styles of the certain lyricist into the network memory. For lyric generations, the first lyric line in Chinese is input by users to generate following lyrics which relate to a certain scenario. The proposed system generates the lyrics line by line and word by word. The contextual attention of the entire generated lyric lines is calculated in each generation step and then used to generate the next lyric line, which stops at the end token. Compared with the previous research works, the proposed system does not rely on other kinds of techniques and resources like ontologies, rhymes, templates or word frequencies. Furthermore, it has easier implementation and better understanding of the contextual information of lyrics than state-of-the-art methods.

2 Related works

A multitude of methods of automatic lyric generation have been proposed in recent years. Most of the traditional methods are based on keywords, rhymes and templates, which try to obtain the semantics of lyrics by using some rule-based constraints. Barbieri et al. [1] utilized the constrained Markov process and rhyme templates to generate lyrics with style. Rajeswari et al. [2] proposed an ontology-based method for interpreting semantics of lyrics and generate Tamil lyrics with the n-gram model. Karteek et al. [3] used an unsupervised Hidden Markov Model (HMM) to identify rhyme schema in hip hop lyrics. These methods took a lot of rules and constraints into consideration, which suffer from both construction and maintenance. Watanabe et al. [4, 5] also proposed a topic transition model based on HMM for lyric generation. All the above mentioned methods are based on the surface forms of words or characters, with barely deep understanding of the meaning of a poem or lyric.

Recent works have shown the effectiveness of deep learning [6] methods for automatic text generation. Mikolov et al. [7] showed that the recurrent neural network (RNN) based language models outperform the standard back-off n-gram models. Sutskever et al. [8] used recurrent neural network to generate text at character level, which significantly learns the grammatical and punctuation rules of language. These RNN models are widely used for the generation of lyrics and poetries recently. Potash et al. [9] proposed a rap lyric generation system using LSTM language model [10]. Zhang et al. [11] used the recurrent context and generation to capture the context information and generate Chinese poetries. These kinds of methods are quite similar, which still lacks some document level contextual support. An encoder and decoder network model (also known as sequence to sequence model) [12] is able to take in a sequence of words and generate another word sequences, which is widely used in neural machine translation [13] recently. The attention mechanism proposed by Luong et al. [14] shows a great effectiveness in capturing the contextual attention of sentences, which it is now widely used in many deep learning approaches. Hu Z et al. [15] propose a new neural generative model which combines variational auto-encoders (VAEs) and holistic attribute discriminators for effective imposition of semantic structures, to generate plausible text sentences, whose attributes are controlled by learning disentangled latent representations with designated semantics.

Compared with previous research works, our proposed method extends the attention mechanism to a higher document level, combining with the LSTM encoder and decoder network to generate lyric lines automatically. With the extended contextual attention model, the system can capture both the semantic meanings of sentences and the entire generated lyric document. The LSTM decoder decodes all the semantic contextual information into lyric lines word by word, which shows great advantage for the lyric generation.

3 Model structure

This section shows the detailed construction of the hierarchical context attention based lyric generation model, as shown in Fig. 1. There are five key components in our model: an LSTM sentence encoder (with a built-in word embedding layer), a sentence-level context attention model, an LSTM lyric encoder, a hierarchical lyric-level context attention model and an LSTM sentence decoder.

Fig. 1
figure 1

The structure of the lyric generation model

3.1 LSTM sentence encoder

The LSTM sentence encoder is designed to encode the input words (i.e. one lyric line), denoted by the embedding of compositional words, into a sequence of hidden states. The LSTM uses a gating mechanism to track the states of the sequence as context information. There are four key components in an LSTM cell: an input gate it, a forget gate ft, a self-recurrent connection Ct and an output gate ot which control the information update of the state all together. At time t, the values of it,ft,Ct are computed as follows:

$$ i_{t}=\sigma (W_{i} x_{t}+U_{i} h_{t-1}+b_{i}) $$
(1)
$$ \widetilde{C_{t}}=tanh(W_{c} x_{t}+U_{c} h_{t-1}+b_{c}) $$
(2)
$$ f_{t}=\sigma (W_{f} x_{t}+U_{f} h_{t-1}+b_{f}) $$
(3)
$$ C_{t}=i_{t}*\widetilde{C_{t}}+f_{t} C_{t-1} $$
(4)

Here W,U,V represent the weight matrices of the LSTM network, and b is the bias vector. The output value and hidden state of are then computed as follows:

$$ o_{t}=\sigma (W_{o} x_{t}+U_{o} h_{t-1}+V_{o} C_{t}+b_{o}) $$
(5)
$$ h_{t}=o_{t}*tanh(C_{t}) $$
(6)

The hidden states and output values of the LSTM sentence encoder represent how much context information each word contributes to the whole sentence.

3.2 Sentence-level context attention model

The attention mechanism provides a general way to group all the hidden states of a sentence to form a contextual attention vector. A sentence i with words wit,t ∈ [1,T] is firstly embedded into a sequence of word vectors using an embedding matrix We, and then fed into the LSTM sentence encoder to obtain the hidden states. The computation can be simplified as follows, where the LSTM calculation is based on Section 3.1:

$$ x_{it}=W_{e}w_{it} $$
(7)
$$ h_{it}=LSTM_{h}(x_{it}) $$
(8)

The contextual attention vector si can be computed as follows:

$$ s_{i} = {\sum\limits_{t}^{T}}\alpha_{it}h_{it} $$
(9)

Here αit is the weight of each hidden state, which is derived by comparing the word embedding with the current hidden state. The αit is computed as follows:

$$ \alpha_{it} = \frac{e^{h_{it}^{\top{x_{it}}}}}{{{\sum}_{t}^{T}}e^{h_{it}^{\top{x_{it}}}}} $$
(10)

The contextual attention vectors of each sentence are calculated with the above equations and then fed into the lyric-level context attention model.

3.3 Hierarchical lyric-level context attention model

With the contextual attention vector and hidden states of a sentence, the next sentence can be generated word by word using the LSTM sentence decoder. However, the generated sentence only has the contextual information of its former input sentence. Since the system always needs to generate multiple lines of lyrics, obtaining the contextual information of every generated sentence would be very useful for the entire lyric generation process.

The contextual attention vector at lyric document level can be computed using the LSTM lyric encoder, which takes in the attention vector of the former sentences to obtain the hidden state and output value of each sentence. For the generation of the sentence m, the former m − 1 sentences are used. The contextual attention vector of the former m − 2 generated lyric lines can be computed as follow, which is similar to the computation of the contextual attention vector of sentences.

$$ h_{i} = LSTM_{h}(s_{i}), i \in [1, m-2] $$
(11)
$$ \alpha_{i} = \frac{e^{h_{i}^{\top{s_{i}}}}}{{\sum}_{i}^{m-2}e^{h_{i}^{\top{s_{i}}}}} $$
(12)
$$ d_{m-2} = \sum\limits_{i}^{m-2}\alpha_{i}h_{i} $$
(13)

For every generation step, the lyric-level context attention vector of the former lyric lines is calculated and then used to predict the next line of lyric.

3.4 LSTM sentence decoder

Given an input vector, the LSTM sentence decoder is used to generate a lyric line word by word. To make full use of the information of attention context, the input vector is calculated using the concatenation of the lyric context attention of last m − 2 sentences dm− 2, the attention context vector of the former sentence sm− 1, and the embedding vector of last generated word xmt. The hidden states and the output values of the LSTM decoder are computed using the LSTM cells:

$$ h_{mt} = LSTM_{h}([d_{m-2} : s_{m-1} : x_{mt}]) $$
(14)
$$ o_{mt} = LSTM_{o}([d_{m-2} : s_{m-1} : x_{mt}]) $$
(15)

A fully connected layer (16) is utilized to map each output of the LSTM decoder into a vector of vocabulary size, and a softmax classifier (17) is used to predict the most appropriate word.

$$ u_{mt} = W_{e}^{\top}o_{mt}+b_{e} $$
(16)
$$ p_{mt} = \frac{e^{u_{mt}}}{{{\sum}_{t}^{T}}e^{u_{mt}}} $$
(17)

As it is shown in (16), the weight matrix of the fully connected layer is tied with the embedding matrix to map the word vector to the appropriate word index, which shows a better performance than the re-training of this layer.

4 Datasets and experiments

4.1 Datasets

The experiments are carried out on four famous Chinese artists, which are Eason Chan, Jay Chou, Teresa Teng, and Faye Wong. The number of song lyrics and total lines for each artist are shown in Table 1.

Table 1 The number of song lyrics used from each singer

4.2 Comparison models

To verify the effectiveness of the proposed hierarchical attention based model, and offer fair comparisons between different models, a series of experiments are conducted on these datasets.

Lyric-No-Attention (LNA)

A basic LSTM encoder-decoder network without attention. The model takes in the previous line of lyric as input and generate the next line word by word. No attention mechanism is used in the model to obtain contextual attention.

Lyric-Single-Attention (LSA)

This model is similar to the LNA model, but a sentence-level attention is used to capture the contextual attention of the input sentence, which is described in Section 3.2.

Lyric-Hierarchical-Attention (LHA)

The proposed hierarchical attention based model is able to process multiple lines of a lyric at a time, and generate the next line of the lyric. The sentence-level attention (Section 3.2) and the hierarchical lyric-level attention (Section 3.3) are both used to obtain sentence attention vectors of the lyric lines and a lyric attention vector of these lines.

4.3 Experimental settings

Each lyric document consists of multiple lyric lines, and each line contains multiple words. There are some difficulties in Chinese lyric preprocessing. For some Chinese artists, other languages like English, Korean and Japanese are also included in their lyrics, which may influence the performance of the model. By using regex expression, most of the words from other languages than Chinese have been cleaned. If a lyric contains more than half of the words from other languages, it would be removed from the entire dataset. There are also some words written in traditional Chinese, which should be transformed into simplified Chinese, to reduce the parameters of the vocabulary. Furthermore, Chinese characters are connected together, which needs to be segmented. All the Chinese lyric lines are segmented using jieba segmenter. All the datasets are randomly shuffled and 10% of each dataset are been used for testing.

Various values of embedding dimension (50, 100, 150, 200), hidden units (50, 100, 150, 200) and number of layers (1, 2, 4) of LSTM networks are tested to find the best parameters for the models. The cross-validations over all datasets show that the larger numbers of embedding dimension and hidden units are used, the smaller validation losses could be obtained. Therefore, both of these two parameters are set to 200. And the number of layers of stacked LSTM networks is set to 2.

No pre-trained word vector is used in each model. To learn word vectors which are related to different datasets from scratch, an embedding layer is used in the sentence encoder and decoder, to map the index of each word into a fixed size dense vector. The weight of the embedding layer is randomly initialized and then trained with the other part of the model during the training phase. Therefore, the word vectors can be obtained after the training on each dataset.

4.4 Evaluation metrics

To offer objective and quantitative evaluation of the proposed model, two evaluation metrics are adopted to evaluate the performance of the models on all dataset, which are similar to Kawthekar et al. [16]. These metrics are described below:

Loss

The cross-entropy loss is calculated for both training and test sets. The test losses of different epochs of training are recorded to draw the loss curves of different models. The calculation of cross entropy for each sample is shown in (18).

$$ H = -\frac{1}{T}{\sum\limits_{t}^{T}}{\sum\limits_{i}^{C}}y_{ti}ln\hat{y_{ti}} $$
(18)

Here, T is the length of the output sequence, C is the vocabulary size, y and \(\hat {y}\) represent the true and predicted value of output sequence respectively.

Perplexity

Perplexity is a measurement of how well a model predicts a sample. A low perplexity indicates the probability distribution is good at predicting the sample. It is a common evaluation metric for text generation tasks. The perplexity of the generated lyrics is calculated as eH, where H is the cross-entropy loss.

4.5 Experimental results

After running for 20 epochs on each dataset, the losses and text perplexity for each model are recorded for detail comparison.

4.5.1 Loss curves

The loss curves of three different models on four different datasets are shown in Fig. 2. Results on all datasets show that the attention based methods gains smaller losses than the baseline model without attention, which means better performance. The proposed LHA model has the best learning loss on all datasets. By capturing the context attention of both sentences and the entire document, the proposed hierarchical context attention based model gains the best results. That is to say, the proposed method is significant in capturing semantic context information of lyrics and generating proper lines.

Fig. 2
figure 2

Loss curves of three models on four test datasets

Meanwhile, the LHA model converges at about 10 to 15 epochs, which is slightly faster than the baseline models. The slight increases of test losses show signs of over fitting on further training.

4.5.2 Perplexity

The perplexities of the three models on all datasets are shown in Table 2. The proposed LHA model gains the lowest perplexities on all datasets, which means best performance comparing to the baseline models.

Table 2 Text perplexities of different models on all datasets

A lower text perplexity means that the model can infer the generated words more correctly. Models with high perplexity seem to generate lyric lines by randomly selecting words from the vocabulary list, which lacks of patterns and styles of the artists. Among all these datasets, the perplexities of Eason Chan are much higher than other artists. One main reason is that it has larger numbers of lyric lines than others, which results in poor performance of all models. Another key reason is that most of the lyrics of Eason Chan are written by different artists, the pattern and style of him are not clear enough to learn. For other artists like Teresa Teng, her songs are always talking about love and friendship, which are much easier to learn. Therefore, the reason why Teresa Teng gains the best results is easy to be explained.

5 Lyric generation line by line

The workflow of the lyric generation process of the LHA model is shown in Fig. 3. To generate lyric lines with human intention, a starting Chinese sentence (s) and the number of lines which need to be generated (n) are inputted into the generative model. The input sentence is firstly segmented into word sequences and then embedded as word vectors. The generation process takes in the sequential word vectors and generates a lyric line word by word until it meets the end token. The generation of the next line is based on the input and contextual attention of the former line, as well as the contextual attention of the entire lyric document. After generating one line, the n should be set to n − 1, if n equals zero, the generation process will stop, the generated n lines of lyrics are treated as the final result.

Fig. 3
figure 3

The workflow of lyric generation process for the LHA model

After training for proper epochs for each artist, the generative model is able to generate some lines of lyrics with a good manner. Tables 345 and 6 show the generated ten lines of lyric for each singer. For example, the generative model learnt from lyrics of Eason Chan is able to generate some lyrics as shown in Table 3, which seem to be mature and sad about love and always look back for most of time. For Jay Chou, the generated lyric lines seem to be young, sweet and mysterious, which is just like the style of the singer. For the famous singer Teresa Teng, sweet love and missing someone are the major theme of the generated lyrics. And for Faye Wong, the generated lyric lines express her loneliness and pursuit of love, which seems that she got hurt a lot in most of the relationships. One important thing is that all the generated lyric lines of each singer share similar styles and patterns, which shows the great effectiveness of the hierarchical context attention model.

Table 3 The generated ten lines for Eason Chan
Table 4 The generated ten lines for Jay Chou
Table 5 The generated ten lines for Teresa Teng
Table 6 The generated ten lines for Faye Wong

6 Conclusion

To meet the challenge of being meaningful and semantically related to a certain scenario, a Chinese lyric generation system is constructed to learn the patterns and styles of certain lyricists and generate lyrics automatically. A Long short-tern memory network encoder-decoder network is utilized to encode each lyric line into higher representation and generate the next line word by word. The LSTM network shows a great power in processing sequential text. A hierarchical context attention model is designed to capture the contextual information at both sentence and document level to learn high level contextual representations of every lyric lines and the entire lyric document. Experimental results show that the proposed hierarchical context attention based model gains the best test losses and text perplexities than other baseline models. The proposed method shows great effectiveness in the lyric generation process, which gains a deep understanding of the lyric document and generates the target lyric line with full context support.

Although the proposed lyric generation system has achieved certain results, it is still insufficient in several aspects such as diversity and stability. There is still a long way to go to meet the standard of professional artists. In recent years, deep reinforcement learning as well as generative adversarial networks has shown great effects in natural language processing, which could be further studied for automatic Chinese lyric generation processes.