Keywords

1 Introduction

Deep learning, in recent years, has seen transformative progress across various disciplines, notably in the sphere of AI music generation. Current music generation research focuses on generating original and high-quality compositions, which relies on two key properties: Structural awareness and interpretive ability. Structural awareness enables models to generate naturally coherent music with long-term dependencies, including repetition and variation. Interpretive ability involves translating complex computational models into interactive interfaces for controllable and expressive performances [106]. Furthermore, AI music generation exhibits generality, allowing the same algorithm to be applied across different genres and datasets, enabling exploration of various musical styles. [6]

The landscape of music generation presents a unique opportunity for novice researchers to explore and contribute to the field of generative AI. However, navigating the vast amount of research and staying up to date can be challenging. Our survey aims to assess the accessibility and feasibility of music generation algorithms, providing guidance for undergraduate researchers to establish a solid foundation in this exciting area of study.

We have collected notable research on automatic music generation to provide a starting point for researchers to further investigate. By systematizing each study’s algorithms and datasets, researchers can identify effective architecture-data pairings. We also select and explain four papers that best represent fundamental ideas and popular techniques in music generation. By engaging the factual information in our resources, researchers can learn and further explore the structural awareness properties of music generation.

Moreover, we include content about the interpretive ability of music generation, allowing connections to be made with the area of human interaction. There is vast potential that lies within the user experience and interfaces side of AI music generation: Researchers can explore how algorithms can be designed to be user-friendly and accessible to a human composer, as well as how these algorithms can be integrated into a creative workflow as a powerful tool to expand and enhance musical ideas.

By studying our resources and content, researchers can find new ideas and draw original connections, enabling them to make significant contributions to the music generation field.

1.1 Related Works

Several review papers provide valuable insights into the trends and methodologies for music generation, such as [6, 42, 99, 106]. We further expand our analysis by exploring a range of scholarly articles that employ analogous time-series data [4, 11, 18, 19, 29, 46, 50, 57,58,59,60,61,62, 73, 75,76,77, 84, 87, 97, 101,102,104, 107,108,109]. The literature highlights that algorithmic performance depends on various factors, including the training data’s quality and diversity, the model’s complexity, and the strategies employed by the model. When it comes to training data, music can be represented in two main formats: Symbolic and raw audio files.

Symbolic audio refers to encoding music information through symbols that represent different aspects of music. The most common form of symbolic music data is MIDI (Music Instrument Digital Interface) which uses discrete values to represent the note pitches and their duration [106].

Raw audio refers to any music file format which encodes an actual audio signal. Such file formats include MP3 files, .WAV files, .FLAC files, and others, which can be used for training algorithms. Raw audio has the advantage of representing expressive characteristics inherent to the original music data, at the cost of being computationally demanding. [6]

1.2 Research Questions

Our investigation seeks to answer the following questions:

  • What are recent trends in music generation research?

  • Which papers should undergraduate researchers read to gain a thorough understanding of AI music generation?

  • Which algorithms and datasets are suited for undergraduate-level research?

2 Methods

2.1 Search Methods for Identification of Studies

The PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) method was employed to select relevant studies for this review. The search was conducted between January and March 2023 on Google Scholar, Papers with Code, and arXiv. A combination of keywords was used: (’Deep Learning’ OR ’Generative Model’ OR ’LSTM’ OR ’GAN’ OR ’VAE’ OR ’Transformer’ OR ’Attention’) AND (’Time Series’ OR ’Music’ OR ’Music Theory’ OR ’jazz’ OR ’MIDI’ OR ’Audio’). The process of identifying and refining the study collection is illustrated in Fig. 1.

Fig. 1.
figure 1

Selection process for the papers

The criteria for selecting appropriate papers are as follows:

  • Task: Focus on studies with the objective of AI music generation, specifically those that create new music with high fidelity and long-term consistency using existing data. Research on outside areas and tasks such as genre classification, computer vision, and medical applications are excluded.

  • Deep Learning: Limit the scope to studies that employ deep learning, defined in this review as having multiple layers with more than one hidden layer. Single-layer generative models are excluded.

  • Transformers: To assist researchers in their exploration of music generation, we recognize the potential for innovation in transformer-based algorithms. As such, we select prominent papers for researchers to learn more about the advancements and capabilities of transformers.

  • Model Availability: Prioritize research with publicly available models and datasets. This excludes public music generation websites, as their data and architectures are often undisclosed and subject to change.

  • Target Audience: The review is tailored for those interested in undergraduate-level research of the intersection of computer science and music. We provide an overview of music generation for students to conduct their own research, developing their skills and knowledge in the field while considering the limited time and experience available to them.

  • Time: The review only includes studies published in 2017 and onward to account for the rapid progress of the field.

3 Results

We categorize 62 papers’ essential algorithms and draw an overall conclusion on their strengths and weaknesses towards the task of music generation. The trends in the model selection are visualized in Fig. 2; transformers have gained significant attention since 2018, while LSTM has shown a decline over the years. AEs and GANs are widely and steadily utilized. This information is summarized in Fig. 3; transformer-based architectures were most commonly used, followed by AEs, GANs, and LSTM neural networks. We also created a list of prominent datasets, in Fig. 4.

Fig. 2.
figure 2

Visualizations

Fig. 3.
figure 3

Algorithm usage by papers

Fig. 4.
figure 4

Datasets used by papers

4 Discussion

4.1 Papers

We have identified four essential papers that we recommend undergraduate students read to gain an understanding of popular generative architectures.

  • Huang et al. [44] introduce a foundational architecture using transformers on MIDI data. Specifically, their architecture is similar to that of the original transformer paper, except they add an innovation they call relative attention. Relative attention modifies the attention mechanism by taking into account how close or far apart two elements of the midi sequence are when determining attention coefficients. This allows for the transformer to generate music which makes more coherent sense on small timescales.

  • Dhariwal et al. [21] present an effective combination of the transformer and VAE which operates on raw audio. The overall architecture of their Jukebox model is that of a VQ-VAE. The model has three levels of VQ-VAEs, each of which independently encodes the input data. Thinking of these levels as being vertically stacked, the topmost level is the coarsest, encoding only high level essential information, while the lowest level encodes the fine details of the music. With these latent spaces, they then train sparse transformers that upsample from a higher level latent space to a lower one. So, to generate music, a sample datapoint from the latent space of the uppermost VQ-VAE, uses transformers to up-sample the datapoint to the latent space of the lower level VQ-VAEs, and then once at the lowest level, use the VQ-VAE decoder to turn the upsampled datapoint into raw audio.

  • Dong et al. [27] introduce MuseGan, a GAN architecture for symbolic multi-track piano roll data. MuseGan employs a WGAN-GP framework, which includes modified objective functions and a gradient penalty for the discriminator, leading to faster convergence and reduced parameter tuning. The model consists of two components: the Multitrack model and the temporal model. The Multitrack model incorporates GAN submodels based on three compositional approaches: jamming, composing, and a hybrid of both. Discriminators within these submodels evaluate the specific characteristics of each track. The temporal model comprises two submodels: one for generation from scratch, capturing temporal and bar-by-bar information, and another for track-conditional generation, using conditional track inputs to generate sequential bars. By combining these models, MuseGan produces latent vectors that incorporate inter-track and temporal information. These vectors are then used to generate piano rolls sequentially.

  • Huang and Yang [47] provide a promising direction of data conversion with their introduction of REMI (revamped MIDI-derived events). Instead of the traditional MIDI-based music representations, REMI describes musical events with further details to represent the original music with more information. Specifically, REMI adds tempo and chords as part of the data, reinterprets the time grid of the music data from second-based to position- and bar-based, and describes the note duration instead of the ending position of the note for note lengths. REMI helped a transformer-based model output samples with stronger sense of downbeat and natural and expressive uses of chords. The paper also introduces Pop Music Transformer, a transformer-based architecture for music generation. This model differs from traditional transformer models in that it learns to compose music over a metrical structure defined in terms of bars, beats, and sub-beats, through the application of the aforementioned REMI. This approach allows the model to generate music with a more salient and consistent rhythmic structure, and produce musically pleasing pieces without human intervention.

4.2 Algorithms

Our survey suggests that the main algorithms used for music generation in the last five years are transformers and autoencoders such as VAEs, GANS, and LSTMs, with transformers being by far the most popular. Due to the popularity of transformers and their success with music generation, we have decided to focus on them for much of our analysis.

Transformers are applicable in symbolic and raw audio domains with convincing results, offering flexibility for researchers to pursue their research interests. Also, the literature shows transformers have broad functionalities and involve specific components and mechanisms makes them worth exploring individually.

These components can be fine tuned and altered to match the needs of a given task. For example, [41, 43] use a modification known as relative attention, which modifies the attention coefficients based one how far apart two tokens are. There is also the transformer-XL modification used by [24]. This modification adds a recurrence mechanism to hidden states within the Transformer, and has been shown to increase performance [17]. Others, such as [21] used sparse transformers. Sparse transformers introduce sparsity to the attention heads of the transformer, reducing \(O(n^2)\) time and memory costs to \(O(n\sqrt{n})\) [13].

4.3 Datasets

A variety of both symbolic and raw audio datasets have were used by transformer papers. We observe that certain datasets work best with transformers in capturing complex and long-range dependencies within music sequences:

The LakhMIDI dataset, the largest dataset of symbolic data available, has great potential due to its large training size and MIDI-audio pairings, and was the most popular among transformer papers we surveyed, with five different papers using it. In [26, 34, 78], the authors use LakhMIDI to derive token sequences from MIDI files to create multitrack music transformer models. In [94], a multi-track pianoroll dataset derived from LakhMIDI is used for a transformer as well. [24] maps LakhMIDI tracks onto instrumentation playable by the NES, and then uses a transformer to generate NES versions of songs.

Similar in size and function is the MAESTRO dataset, which is used by the transformer models of [41, 67], and is the second most popular dataset among transformer papers. The MAESTRO dataset contains over 200 h of piano performances, stored in both raw audio and MIDI formats. The MIDI-audio pairings enables music information to be retrieved from the MIDI files and be used as annotations for the matched audio files. Every file is labeled, allowing for easy supervised training.

Besides LakhMIDI and MAESTRO, every other transformer paper we found used a different dataset. In fact, a popular choice among them was to create or scrape their own dataset, which was done by [21, 47, 89, 100].

Overall, for undergraduates working with Transformers, if working with symbolic data we would recommend using the Lakh MIDI dataset due to its large size and relative popularity among other users of transformers. For those who want to use raw audio, we would recommend the MAESTRO dataset, similarly for its large size and popularity.

4.4 Future Work

The future development of music generation technology is increasingly focused on enhancing the ability to control models structurally. In later work, more research and analysis could investigate the data and model decisions behind each study, as well as increase the target research, to better comprehend factors contributing to successful music generation outcomes.

Moving forward, we encourage undergraduate researchers to engage in more experimental and collaborative work, exploring the combination of different algorithms and datasets to develop new approaches to music generation.

5 Conclusion

In conclusion, our review provides a survey of deep learning algorithms and datasets for music generation, with the aim to assist undergraduate researchers interested in the field. Our findings suggest that in the last five years, transformers, GANS, autoencoders, and LSTMs have been the primary algorithms used for AI music generation, with transformers gaining significant popularity in more recent times. We find that the papers use a wide variety of datasets, meaning there is no one single, predominant dataset being used. We suggest four papers that we believe are important for undergraduates to read to get a solid grasp of the field, and also recommend an algorithm along with datasets for undergraduates to use for their initial adventures into AI music generation.