Keywords

1 Introduction

Music has long been a powerful medium for emotional expression and communication [16]. The emotional response that music elicits has been studied by scholars from various fields such as psychology [19], musicology [15], and neuroscience [17]. Especially with the advent of deep learning, there has been an increasing interest in developing machine learning algorithms to automatically analyze and generate music that can evoke specific emotions in listeners [3].

Symbolic music—or MIDI (Musical Instrument Digital Interface) as it is used interchangeably—is represented as a sequence of notes and is a popular choice for machine learning models due to its compact and structured representation. Large raw MIDI datasets [30, 31] enable unsupervised training of deep neural networks to automatically generate symbolic music. Similar to language modeling, these networks learn to predict the next token i.e. the next note, and at inference time, generate output autoregressively, one token at a time.

However, a human composer’s creative process does not simply involve mechanically writing one note after another; it often includes high-level concepts such as motifs, themes and, ultimately, emotions [24]. To train deep neural networks to generate music based on emotions, large datasets of symbolic music annotated with emotional labels are required. Although there are some publicly available datasets with emotional labels, they are relatively small and do not cover a wide range of emotional states [33].

To address this issue, we present a new large-scale emotion-labeled symbolic music dataset created by analyzing the lyrics of the songs. Our approach leverages the natural connection between lyrics and music, established through emotions. To this end, we first trained models for emotion classification from text on GoEmotions [5], one of the largest text datasets with 28 fine-grained emotion labels. Using a model that is half the size of the baseline model, we obtained state-of-the-art results on this dataset. Later, we applied this model to the lyrics of songs from two of the biggest available MIDI datasets, namely Lakh MIDI dataset [30] and Reddit MIDI dataset [31]. Ultimately, we created a symbolic music dataset consisting of 12 k MIDI songs labeled with fine-grained emotions. We hope that this dataset will encourage further research in the field of affective algorithmic composition and contribute to the development of intelligent music systems that can understand and evoke specific emotions in listeners.

The remaining of this paper has the following structure: after having introduced our aim and the overall results in Sects. 1 and 2 presents the current state of the art on the most relevant topics for this work, namely text emotion classification and the existing emotion-labeled symbolic music datasets. Section 3 will delve into the proposed solution describing all the implemented steps, while results are presented and discussed in Sect. 4. Finally, we conclude by pointing out some possible future work in Sect. 5.

2 Related Work

2.1 Text Emotion Classification

Emotion classification from text—or sentiment analysis, as used interchangeably in the machine learning literature—allows us to automatically identify and/or quantify the emotion expressed in a piece of text, such as a review, social media post, or customer feedback [23]. Identifying the underlying emotion in text is useful in various fields such as customer service [10], finance [25], politics [14], and entertainment [1].

Machine learning methods have significantly advanced the state of the art in text emotion classification for the past two decades. However, the earliest works in this field relied on hand-crafted features, such as frequently used n-grams [27], or adjectives and adverbs that are associated with particular emotions [35]. Nonetheless, the advent of deep learning has made it computationally feasible to process raw inputs without extracting features manually, leading to better performance [18]. Recurrent Neural Networks and their improved variants such as Long Short-Term Memory were initially used [22] but were later replaced by the transformer model [34], which is the current state of the art in natural language processing (NLP) tasks.

Fine-tuning pretrained models on specific tasks has been shown to produce better performance. The GPT (generative pretraining) model is a large transformer that was pretrained on the task of next token prediction and then was fine-tuned on specific NLP tasks, resulting in state-of-the-art performance [29]. The BERT (Bidirectional Encoder Representations from Transformers) model improved upon these results by employing masked token prediction as its pretraining task [6].

2.2 Emotion-Labeled Symbolic Music Datasets

MIDI (Musical Instrument Digital Interface) is a symbolic music format widely used to represent musical performances and compositions in the digital domain. MIDI files contain only the musical information, such as the notes, tempo, and dynamics, without the sound itself, like a “digital music sheet”. Compared to audio formats, MIDI files have a smaller size and dimensionality, which makes them more manageable and suitable for modeling with deep neural networks [3].

The majority of existing literature on symbolic music generation relies on a non-conditional approach. In other words, these methods are trained on raw MIDI data without any explicit labels, allowing them to generate new music that is similar to the examples in the training dataset [12]. Some approaches, however, leverage low-level features within the data to create music in a conditional way [11]. For instance, they might use short melodies, chords, or single-instrument tracks as a basis for generating corresponding melodies. While such methods could be considered as “conditional", they do not make use of specific labels and are thus unable to capture high-level factors such as emotions or genres.

Using emotion as the specific high-level condition gives rise to the field of “affective algorithmic composition” (AAC) [36]. However, the development of machine learning AAC models is currently limited by the lack of large-scale symbolic music datasets with emotion labels. Some existing datasets include VGMIDI, which contains 204 piano-based video game soundtracks with continuous valence and arousal labels [8], Panda et al., which includes 193 samples with discrete emotion labels [26], and EMOPIA, which consists of 387 piano-based pop songs with four emotion labels [13]. Unfortunately, due to their small sizes, these datasets are insufficient for training deep neural networks with millions of parameters. Sulun et al. addressed this issue by labeling 34 k samples with continuous valence and arousal labels [33]. Though initially designed for audio samples, these labels were matched to their corresponding MIDI files to train emotion-based symbolic music generators that produced output music with emotional coherence. While this study exploited the correspondence between audio and symbolic music, there has been no utilization of the correspondence between lyrics and symbolic music to acquire high-level semantic labels.

3 Methodology

This section outlines the steps we followed to achieve our goal of creating a symbolic music dataset with emotion labels. Specifically, we begin by describing the model utilized for emotion classification, followed by a discussion of the training process, and conclude with an overview of how the model was applied to song lyrics to extract the corresponding emotion labels.

3.1 Model

We employ DistilBERT as the backbone of our model [32], which is a condensed and compressed variant of the BERT (Bidirectional Encoder Representations from Transformers) model [6], achieved through knowledge distillation [4, 9]. DistilBERT utilizes fewer layers than BERT and learns from BERT’s outputs to mimic its behavior. Our model consists of 6 layers, with each layer containing 12 attention heads and a dimensionality of 768, yielding a total of 67 M parameters. To facilitate multi-label classification, we have customized the output layer while adding a sigmoid activation layer at the end. The output layer’s size is determined by the number of labels present in the training dataset, which can be either 7 or 28.

3.2 Training

The first step towards our aim of building an emotion-labeled symbolic music dataset is to train the model to perform multi-label emotion classification based on text input.

Dataset

We trained our model using the GoEmotions dataset [5]. This dataset consists of English comments from the website reddit.com, which were manually annotated to identify the underlying emotions. It is a multi-label dataset, which means that each comment can have more than one emotion label. The dataset comprises 27 emotions and a “neutral” label. The labels are further grouped into 7 categories, including the six basic emotions identified by Ekman (joy, anger, fear, sadness, disgust, and surprise) as well as the “neutral” label [7]. The dataset has a total of 58 k samples, which were split into training, validation, and testing sets in the ratio of 80, 10, and 10%, respectively. Given the number of labels and its size, GoEmotions is one of the largest emotion classification datasets and has the highest number of discrete emotion labels [20].

Training and Evaluation Metrics

We trained our models using binary cross-entropy loss. For evaluation, we used precision, recall, and F1-score, with macro averaging. The decision cutoff was set at 0.3, meaning that predictions with a value of 0.3 or greater are considered positive predictions and others negative.

Implementation Details

We trained two models to classify a given text into 7 and 28 labels. We used a dropout rate of 0.1 and a gradient clipping norm of 1. The batch size was set to 16 for the model with 7 output labels and to 32 for the model with 28 output labels. We applied a learning rate of \(5e-5\) for the former and \(3e-5\) for the latter. We used early stopping considering the F1-score on the validation dataset, which corresponded to training for 10 epochs for both models. We implemented the models using Huggingface library [37] with Pytorch backend [28] and trained them using a single Nvidia GeForce GTX 1080 Ti GPU.

3.3 Inference

After training the models for text-based emotion classification, we used it in inference mode, using the song lyrics from the MIDI files as inputs. This allowed us to create a MIDI dataset labeled with emotions.

Datasets

We used two MIDI datasets that are publicly available and were created by gathering MIDI files from various online sources: the Lakh MIDI dataset consisting of 176 k samples [30] and the Reddit MIDI dataset containing 130 k samples [31]. We filtered the datasets by selecting MIDI files that contain lyrics in the English language with at least 50 words. This filtering process resulted in a total of 12509 files, consisting of 8386 files from the Lakh MIDI dataset and 4123 files from the Reddit MIDI dataset. During inference, we utilized the two pretrained models, feeding the entire song’s lyrics, using a truncation length of 512.

4 Results

In this section, we will first present the emotion classification performance of our trained models. Then, we will introduce the emotion-labeled MIDI dataset, which we created by analyzing the sentiment of the song lyrics using our trained models.

4.1 Emotion Classification on the GoEmotions Dataset

We evaluated the performance of our trained models on the test split of the GoEmotions dataset and compared our results with the baseline presented in the original paper [5]. Similar to the original paper, we report our results for scenarios using two sets of labels, with 7 and 28 emotions. For each label, we reported the precision, recall, and F1-scores along with the macro-averages. It is important to mention that, as the dataset is imbalanced, macro-averaging is more appropriate than micro-averaging, as it was also used in the original paper. We note that the baseline model is BERT and has twice the size of our model [6].

The trade-off between precision and recall is determined by the cutoff value. Therefore, we emphasize higher F1-scores because they provide a more balanced perspective by taking the harmonic mean of precision and recall, and are much less sensitive to the cutoff value. Although the original paper did not state the cutoff value, we achieved the best F1-score and similar performance to the original paper on the 7-label dataset using a cutoff value of 0.3. For consistency, we used the same value for the 28-label dataset. We present our results on the dataset with 28 and 7 labels in Tables 1 and 2, respectively.

Table 1. 7-label classification results
Table 2. 28-label classification results

Based on the F1-scores, our model performs comparably to the baseline on the 7-label dataset. Specifically, our model has a better performance on 2 labels, worse on 2 labels, and the same on 3 labels, as well as for the macro-average. On the 28-label dataset, our model surpasses the baseline with only a lower performance on 2 labels, equal performance on 4 labels, and better performance on the remaining 22 labels. Furthermore, our model demonstrates an improvement of 0.04 in terms of the macro-average.

We hypothesize that a smaller model, such as ours (DistilBERT), may perform better than a larger baseline model (BERT) in certain settings, such as when there are a limited number of training samples or a high output/target dimensionality, as in the case of the 28-label dataset. In these scenarios, models are more prone to overfitting, as has been previously observed [38]. Additionally, the original paper [32] demonstrates that the DistilBERT model outperforms BERT on the Winograd Natural Language Inference (WNLI) dataset [21].

4.2 Labeled MIDI Dataset

We used our trained models to analyze the song lyrics of the Lakh and Reddit MIDI datasets, resulting in an augmented dataset that contains the file paths to 12509 MIDI files and their corresponding predicted probabilities for emotion labels. To provide more flexibility to the users, we did not apply a threshold to the predicted probabilities, allowing the entire dataset to be used as is. We generated two CSV (comma-separated values) files containing the 7 and 28 emotion labels as columns, with the 12509 MIDI file paths as rows. Our code for inference, trained models, and datasets are available online.Footnote 1

For demonstration purposes, we provide transposed versions of the tables, using only 3 samples, shown in Tables 3 and 4. We note that the values do not necessarily add up to one, due to the nature of multi-label classification.

For further demonstration and ease of analysis, we provide excerpts from the lyrics of each of the three sample songs in Listing 1.1, along with the emotions having predicted probabilities higher than 0.1 in descending order. It is noteworthy that having a dataset with 28 emotion labels allows for a more nuanced representation of emotions. For instance, when we examine this dataset, the song “Imagine” is predicted to have “optimism” as its top emotion, whereas “Take a Chance on Me” is predicted to have “caring” as its top emotion. However, both songs are predicted to have “joy” as their top emotion in the dataset with only seven labels.

We also present the number of samples containing each emotion in our datasets in Fig. 1. In these figures, we excluded the “neutral” label. We also considered emotions with a prediction value higher than 0.1 as positive labels, meaning that those emotions are present for a given sample.

figure a
Fig. 1.
figure 1

The number of samples containing each emotion in our 7-label (left) and 28-label (right) datasets. The “neutral” label is excluded. Emotions with a prediction value higher than 0.1 are considered positive labels, meaning that those emotions are present for a given sample.

Table 3. Sample entries from the 28-label dataset.
Table 4. Sample entries from the 7-label dataset. Due to space limitations, the file paths are replaced with the artist and song names and are as the following: John Lennon—Imagine: “lakh/5/58c076b72d5115486c09a7d9e6df1029.mid” (artist and title obtained using Million Song Dataset [2]), ABBA - Take a Chance on Me: “reddit/A/ABBA.Take a chance on me K.mid”, Elvis Presley—Are You Lonesome Tonight: “reddit/P/PRESLEY.Are you lonesome tonight K.mid

5 Conclusion and Future Work

In this work, we first trained models on the largest text-based emotion classification dataset, GoEmotions, in both 7-label and 28-label variants [5]. We achieved state-of-the-art results using a model half the size of the baseline. We then used these trained models to analyze the emotions of the song lyrics from the two largest MIDI datasets, Lakh MIDI dataset [30] and Reddit MIDI dataset [31]. This analysis resulted in an augmented dataset of 12509 MIDI files with emotion labels in a multi-label format, using either 7 basic-level or 28 fine-grained emotions. We made the datasets, inference code, and trained models available for researchers to use in various tasks, including symbolic music processing, natural language processing, and sentiment analysis.

In our future work, we plan to further narrow the considerable gap between symbolic music and emotion. In particular, we aim to create superior models that can automatically compose music that is based on emotions or user-provided input. We believe that incorporating emotions is vital in composing music, hence it can help to push the boundaries of computational creativity, bringing it one step closer to human-like performance.