1 Introduction

The desire to understand creativity has driven the development of computationally creative systems among a wide variety of tasks [5]. Just as deep learning has reshaped the whole field of artificial intelligence, it has reinvented generative modeling in recent years [63]. This thriving research area includes, for example, the creative generation or the style transfer of artwork such as paintings or music [15, 21].

Even with the research interest in generative systems, the assessment and evaluation of such systems has proven challenging. Formally, categorization of evaluation strategies can be derived from specifying the design ontology of the system. For instance, based on the Function–Behavior–Structure (FBS) ontology [18, 62], we evaluate the actual behavior of a system compared to its expected behavior. The evaluation of creative systems can be categorized into function and structure evaluation, which relates directly to the so-called summative and formative approaches. While the former aims to assess whether the results of a system meet the stated goal of creativity, the latter focuses on monitoring how the instructional goals and objectives are being met [13, 20, 46]. Without a clear definition and consensus on the essence of (human) creativity, summative evaluation remains largely problematic [28].

As the ultimate judge of creative output is the human (listener or viewer), subjective evaluation is generally preferable in generative modeling. The challenges of designing and conducting an experiment leading to valid, reliable, and replicable results, however, are often underestimated. Controlling all relevant variables, eliminating bias, and recruiting a sufficient number of qualified subjects can easily blow the required resources out of reach for small-scale projects. The most common shortcomings of subjective studies evaluating generative systems are closely related to both the available resources and the design of experimental methodology [28, 47].

Thus, a method for objective evaluation of generative systems is desirable.

The image generation community has benefited from the introduction of the idea of the inception score by Salimans et al. [47]. It uses a pattern recognition model to assess the generated sample. The general concept of the inception score is based on the assumption that a well-trained image classifier roughly has a human-like classification ability [47]. This idea has been adapted by multiple researchers to allow for an objective measure of various generative systems [26, 29, 39]. The idea of the inception score is convincing and the first results look promising; ultimately, however, the assumed correlation to human judgment still needs further scientific examination [19, 64].

The evaluation of generative music systems faces even harder challenges than that of image generation systems [9]. The sequential yet highly structured form, the ever-changing interaction between composition and performance, and the abstract nature of meaning and emotion in music [36, 61] make a semantic description of music exceedingly hard. The automatic analysis and categorization of music is, although having made great progress, not close to human-level performance [35]. This makes assessing music very difficult [3, 22, 41, 59] and partly explains why music assessment could not be automated by computational models so far.

Despite these high-level challenges, we will show below that state-of-the-art generative music systems struggle with creating musical content that follows basic technical rules and expectations. We argue that these technicalities have to be solved before addressing the questions of aesthetics of creative works with high-level structural and harmonic properties.

Therefore, we propose a formative evaluation strategy for systems generating symbolic music. The proposed method does not aim at assessing musical pieces in the context of human-level creativity nor does it attempt to model the aesthetic perception of music. It rather applies the concept of multicriteria evaluation [54] in order to provide metrics that assess basic technical properties of the generated music and help researchers identify issues and specific characteristics of both model and dataset. The usefulness of the presented method is demonstrated through a series of experiments, including dataset analysis, comparison of state-of-the-art music generation models, and assessment of generative music systems.

2 Related work

As mentioned above, research on automatic music generation systems has suffered from the difficulty of designing evaluation methodologies [42]. The two challenges of measuring the success of a generative system are addressing the summative and the formative assessment of the system behavior. Subjective approaches to measuring the success of generative systems by means of listening experiments can often be categorized as summative assessment while objective evaluation strategies mostly fall into the category of formative assessment. Confusing these two challenges leads to unclear evaluation strategies. Although subjective evaluation is generally preferable for evaluating generative modeling, it might require significant resources. Objective methods, on the other hand, can be easily executed yet often lack musical relevance as they are often not based on musical rule systems or heuristics.

Table 1 Experiment design for subjectively evaluating music generation research

2.1 Subjective evaluation in music generation

Most assessments of generated symbolic music are based on inputs from human listeners. These evaluations either follow the concept of a musical Turing test [3] or use query metrics based on the modeled compositional theory [2].

The Turing test [55] follows an intuitive concept that evaluates whether a machine is able to exhibit behavior indistinguishable from humans. One strategy to adapt the Turing test to generative music systems is asking the subjects to identify the pieces they consider to be composed by a human as opposed to a computer [34]. This strategy has been used in several studies as listed in Table 1 [1, 21, 24, 25, 32, 49]. Over the past decades, shortcomings of the Turing test have been pointed out in various areas [2, 17, 44]. Many of these problems also apply to musical Turing tests. One of the fundamental issues, however, is that many studies confound the two questions on whether a piece is aesthetically pleasing and whether it is composed by a human.

The design of a listening experiment is complex due to the many variables ranging from the selection and rendition of audio examples, the listening environment, and the selection of subjects, to the phrasing of the questions. Without proper guidance (compare, e.g., [6]), we find that many contemporary studies struggle with presenting significant scientific evidence. Table 1 lists some of the variables for several major subjective evaluation studies in the context of music generation. It is worth noting that all of these evaluations are performed with a different problem configuration, i.e., different evaluation criteria are used, and both the questionnaires and the listening examples are proprietary (if not arbitrary) and hard to compare. Without addressing these issues properly, the reported results can only be understood as providing preliminary evaluation results and fail at representing a scientific benchmark. First, the majority of them ignore factors associated with the subjects themselves (e.g., their level of expertise), which influences further analysis and the reliability of the experimental result [6]. Second, most of the studies rely—probably due to limited resources—on a relatively small sample size [11, 43, 60], which raises questions about the range of the confidence interval and the study’s statistical significance (which are often not reported) [33]. Note that the common lack of reported statistical measures of confidence and significance in itself could be seen as an indicator of insufficient scientific rigor. Finally, some of the studies rely on the preference of one model over another [11, 60]. The drawback of such a test paradigm is the absence of a standard comparison or absolute reference. While it can be used to measure relative differences or improvements, it cannot provide any absolute measurement of quality.

Last but not least, these tests carry the risk of overestimating the subject’s comprehension, as Ariza concludes after comparing several subjective evaluation methods (e.g., Musical Turing Tests, Musical Directive Toy Tests and Musical Output Toy Tests) [2].

2.2 Objective evaluation in music generation

Given the advantages over subjective evaluation with respect to reproducibility and required resources, several recent studies have assessed their models objectively. We categorize the objective evaluation methods used by the recent studies on data-driven music generation into the following categories: (1) probabilistic measures without musical domain knowledge, (2) task-/model-specific metrics, and (3) metrics using general musical domain knowledge.

2.2.1 Probabilistic measures

The use of evaluation metrics based on probabilistic measures such as likelihood and density estimation has been successfully used in tasks such as image generation [54] and is increasingly used in music-related tasks as well [14, 52]. For example, Huang et al. [24] propose a frame-wise evaluation computing the negative log-likelihood between the model output and the ground truth across frames. Similarly, Johnson considers the note combinations over time steps of the training data as the ground truth and reports the summation of the generated sequence’s log-likelihood across notes and time steps [27]. Since the recurrent model used in his study is trained with the goal of maximizing the log-likelihood of each training sequence, the measure is argued to be a meaningful quantitative measure of the performance. The used probabilistic measures provide objective information, yet Theis et al. observe that “A good performance with respect to one criterion does not necessarily imply a good performance with respect to another criterion” and provide examples of bad samples with very high likelihoods [54].

2.2.2 Model-specific metrics

As the approaches and models vary greatly between different generative systems, some of the evaluation metrics are correspondingly designed for a specific model or task. Bretan et al. proposed a metric for successfully predicting a music unit from a pool of units in a generative system by evaluating the rank of the target unit [8]. Mogren designed metrics informed by statistical measurements of polyphony, scale consistency, repetitions, and tone span to monitor the model’s characteristics during its training [37]. Common to these evaluation approaches is the use of domain-specific, custom-designed metrics as opposed to standard metrics. Obviously, the authors realized the problems with using standard metrics (e.g., edit distance of melodies) as musically meaningless and implemented metrics inspired by domain knowledge. The variability and diversity of the proposed metrics, however, leads to comparability issues. The design of nonstandard metrics also poses additional dangers, such as evaluating only one aspect of the output, or evaluating with a metric that is part of the system design.

2.2.3 Metrics based on domain knowledge

To address the multi-criteria nature of generative systems and their evaluation [9], various humanly interpretable metrics have been proposed. More specifically, these metrics integrate musical domain knowledge and enable detailed evaluation with respect to specific music characteristics. Chuan et al. utilize metrics modeling the tonal tension and interval frequencies to compare how different feature representations can influence a model’s performance [12]. Sturm et al. [52] provide a statistical analysis of the musical events (occurrence of specific meters and modes, pitch class distributions, etc.), followed by a discussion with examples on the different application scenarios. Similarly, Dong et al. apply statistic analysis including tonal distance, rhythmic patterns, and pitch classes to evaluate a multi-track music generator [14]. The advantages of metrics taking into account domain knowledge are not only in their interpretability, but also in their generalizability and validity—at least as long as the designed model aims to generate music under the established rules.

3 Method

Following the approach of using domain knowledge for designing human-interpretable evaluation metrics for generative music systems, we present a formative evaluation strategy based on a comprehensive set of simple yet musically meaningful features that can be easily applied to a wide variety of different symbolic music generation models.

The two targets of the proposed evaluation strategy are to provide (1) absolute metrics in order to give insights into properties and characteristics of a generated or collected set of data and (2) relative metrics in order to compare two sets of data, e.g., training and generated. The overall method is illustrated in Fig. 1 and described below.

In a first step, we gather two collections of samples as our input datasets. For the application of objective evaluation, one dataset contains generated samples, the other contains samples from the training (target) dataset. This approach can also be used for applications such as dataset analysis or the comparison of characteristics of two generative systems. We then extract a set of custom-designed features that are rooted in musical domain knowledge yet easy to understand and interpret. These features encompass both pitch-based and rhythm-based features. After extracting these features for both datasets, we are able to compute both an absolute measurement (Fig. 1, top) and a relative measurement. The absolute measurement can provide useful insights to a system developer about the training dataset properties and generative system’s characteristics.

The relative measurement (Fig. 1, bottom), on the other hand, allows to compare two distributions in various dimensions. It is computed by first applying pairwise exhaustive cross-validation to compute the distance of each sample to either the same dataset (intra-dataset) or to the other dataset (inter-dataset). The results are distance histograms per feature. Next, the probability distribution function (PDF) of each feature histogram is estimated by kernel density estimation [50].

Finally, we compute two metrics for the objective evaluation of generative systems from the training dataset’s intra-set distance PDF (target distribution) and the inter-set distance PDF between the training and generated datasets: (1) the area of overlap and (2) the Kullback–Leibler Divergence (KLD). The steps are introduced in detail in the following sections.

Fig. 1
figure 1

General work flow of the proposed method

3.1 Input representation

Our proposed evaluation method reads input files in Musical Instrument Digital Interface (MIDI) format. MIDI is considered as one of the standard formats of symbolic domain representation of music [38]. Although a music generation system might have its own data representation and output format, the output is usually converted to MIDI format for distribution and auralization. The MIDI file format also provides useful musical metadata such as the time signature and the bar length through the resolution of the MIDI file.

For the current implementation of our method, the input samples are required to be monophonic melodies with a fixed number of measures.

3.2 Feature extraction

The features listed below are computed for both, the entire sequence, and for each measure in order to get some structural information.

3.2.1 Pitch-based features

  1. 1.

    Pitch count (PC): The number of different pitches within a sample. The output is a scalar for each sample.

  2. 2.

    Pitch class histogram (PCH): The pitch class histogram is an octave-independent representation of the pitch content with a dimensionality of 12 for a chromatic scale [4, 40]. In our case, it represents the octave-independent chromatic quantization of the frequency continuum.

  3. 3.

    Pitch class transition matrix (PCTM): The transition of pitch classes contains useful information for tasks such as key detection [30, 53], chord recognition [31], or genre pattern recognition [10]. The two-dimensional pitch class transition matrix is a histogram-like representation computed by counting the pitch transitions for each (ordered) pair of notes. The resulting feature dimensionality is 12 × 12.

  4. 4.

    Pitch range (PR): The pitch range is calculated by subtraction of the highest and lowest used pitch in semitones. The output is a scalar for each sample.

  5. 5.

    Average pitch interval (PI): Average value of the interval between two consecutive pitches in semi-tones. The output is a scalar for each sample.

3.2.2 Rhythm-based features

  1. 1.

    Note count (NC): The number of used notes. As opposed to the pitch count, the note count does not contain pitch information but is a rhythm-related feature. The output is a scalar for each sample.

  2. 2.

    Average inter-onset-interval (IOI): To calculate the inter-onset-interval in the symbolic music domain, we find the time between two consecutive notes. The output is a scalar in seconds for each sample.

  3. 3.

    Note length histogram (NLH): To extract the note length histogram, we first define a set of allowable beat length classes [full, half, quarter, 8th, 16th, dot half, dot quarter, dot 8th, dot 16th, half note triplet, quarter note triplet, 8th note triplet]. The rest option, when activated, will double the vector size to represent the same lengths for rests. The classification of each event is performed by dividing the basic unit into the length of (barlength) / 96, and each note length is quantized to the closest length category. The output vector has a length of either 12 or 24, respectively.

  4. 4.

    Note length transition matrix (NLTM): Similar to the pitch class transition matrix, the note length transition matrix provides useful information for rhythm description [57]. The output feature dimension is 12 × 12 or 24 × 24, respectively.

Obtaining these domain knowledge-based features give us a generally interpretable representation of the data. The features, however, have different dimensionality and normalization, complicating their direct use. Therefore, additional processing is applied to all these features.

3.3 Absolute measurement

During the model design phase of a generative system, it can be of interest to investigate absolute metrics from the output of different system iterations or of datasets as opposed to a relative evaluation. A typical example is the comparison of the generated results from two generative systems: although the model properties cannot be determined precisely for a data-driven approach, the observation of the generated samples can justify or invalidate system design choices. (e.g., Sect. 4.2).

To acquire the analysis, the mean and standard deviationFootnote 1 of each feature of the data are computed.

3.4 Relative measurement

In order to enable the comparison of different sets of data, the relative measure generalizes the result among features with various dimensions; the features are summarized to (1) the intra-set distances and (2) the difference of intra-set and inter-set distances.

3.4.1 Pairwise cross-validation

To compare the distance of the features within and between sets of data, a pairwise exhaustive cross-validation [16] is performed for each feature. In each cross-validation step, the Euclidean distance of one sample to each of the other samples is computed. If the cross-validation is computed within one set of data, we will refer to it as intra-set distances. If each sample of one set is compared with all samples of the other set, we call it the inter-set distances. The output of this process is a histogram of distances for each feature.

3.4.2 Kernel density estimation

In order to smooth the histogram results for a more generalizable representation, kernel density estimation [50] is applied to convert the histograms into PDFs. A Gaussian kernel and Scott’s rule of thumb of bandwidth selection [48, 56] is used for all features in inter-set and intra-set distances.

Note that the feature dimension plays a role impacting the robustness of density estimation. Silverman provides examples for the relation of sample size and dimensionality for the density estimation and the corresponding mean square error [50].

For the estimated PDFs, simple statistical measures such as mean and standard deviation (STD) can be extracted and directly convey properties of the input datasets. For instance, the mean value in the intra-set distances corresponds to the diversity of the samples within a dataset, and the mean value of the inter-set distances is a measure of the average similarity of the two input datasets in this feature dimension. On the other hand, the STD value serves as an indication of the reliability of mean value.

3.4.3 Kullback–Leibler divergence and overlapped area

In addition to the statistical measures representing intra-set distances or inter-set distances, similarity measures between distributions are also of interest in the application of evaluating music generative systems. Two metrics are computed, the Kullback–Leibler Divergence (KLD) and Overlapping Area (OA) of two PDFs. We propose to compute the distance between the target dataset’s intra-set PDF and the inter-set PDF.

Fig. 2
figure 2

Example of the proposed evaluation metric: measuring difference of intra-set and inter-set distances by Kullback–Leibler divergence (KLD) and Overlapped area (OA)

Although the KLD is the most common measure of how two PDFs diverge from each other, it is unbounded and asymmetric, i.e., \(D_{KL}(A||B) \not \equiv D_{KL}(B||A)\)); for this reason, we further calculate the OA to provide a bounded measure in the range \(\in [0,1]\).

The above similarity measures can indicate the behavior of the evaluated system, as it compares the similarity of two input datasets to each other and within themselves. An artificial example is illustrated in Fig. 2, where we calculate the intra-set and inter-set distances among three sets of randomly sampled entries from Gaussian distributions with same variance but different mean value (Set 1: \(\mu =0, \sigma =1\); Set 2: \(\mu =2, \sigma =1\); Set 3: \(\mu =5, \sigma =1\)). Three datasets all have identical intra-set distances, but distinct inter-set distances. By applying the proposed metric, the smaller KLD and larger OA between Set 2 and Set 1 inter-set distances and Set 1 intra-set distances shows that Set 2 is more similar to Set 1.

Table 2 Experimental result of data set evaluation (see Sect. 4.1)

4 Use-case demonstration and discussion

Three experiments are conducted to demonstrate the value of the proposed analysis of musical characteristics:

  1. 1.

    Exp. 1—Dataset evaluation: the analysis of datasets is one of the fundamental processes of a data-driven experiment. In this experiment, we evaluate (the differences between) two datasets from different music genres, and how this result could inform the developer of a generative system.

  2. 2.

    Exp. 2—System comparison: as mentioned above (see Sect. 2.1), the comparison between two generative systems is a common approach in subjective evaluation experiments. In this experiment, we evaluate two music generation systems and compare the results with the summative answers from a subjective evaluation of these systems.

  3. 3.

    Exp. 3—Performance evaluation: a typical problem after prototyping a generative system is the parametrization of the system. This experiment is an example for the typical usage of the objective evaluation method. We discuss how parameters can influence the result of a generative system by comparing the generated samples with the training dataset.

4.1 Experiment 1: Dataset evaluation

Musical style is defined by a set musical characteristics. Due to the complexity of musical content, observing style and properties of a music dataset can be a major challenge. This experiment aims to demonstrate how the proposed approach allows to characterize data from two different music genres and provide insights into genre-specific properties.

Fig. 3
figure 3

Example of absolute measurement: a average pitch class transition matrix (PCTM) and b average note length transition matrix (NLTM) of Jazz and Folk music dataset (see Sect. 4.1)

4.1.1 Input datasets

The chosen two genres are folk and jazz music. The folk music dataset is the Irish Tunes collected from the Henrik Norbeck’s ABC Tunes website [23]. The jazz music dataset comprises jazz lead sheets from both the Wikifonia database [51] and publicly available jazz solo transcriptions collected by Mason et al. [8].

The folk and jazz music datasets contain 2351 and 392 entries, respectively. A pilot experiment determining the necessary amount of samples was carried out. The experiment was then executed with 100 randomly selected songs from each dataset. Of these songs, only the first 8 bars are considered.

4.1.2 Analysis and discussion

Table 2 lists the results for both the intra-set distances and the absolute measurements for features with one dimension. We can make the following observations. First, the higher mean of the intra-set self-distance for nearly all features in the jazz genre as compared to folk indicates that samples in the jazz genre generally have a higher diversity, a result that matches expectation as folk is often based on simple patterns [45] while jazz generally allows more freedom in its musical composition [7]. Second, we observe considerable differences for the absolute measures of features such as note count and average inter-onset-interval.

Figure 3a illustrates the average pitch class transition matrices (PCTM). The folk dataset is more restricted in the usage of certain pitches (i.e., \(\hbox {D}\sharp\), F, \(\hbox {G}\sharp\), \(\hbox {B}\flat\).) and shows a comparably sparse matrix compared to jazz, where both pitches and pitch transitions tend to have more variety.

We can also observe that the folk music dataset shows a larger mean for features such as note length histogram (NLH), and note length transition matrix (NLTM). However, by illustrating the average NLTMs in Fig. 3b, we notice that folk dataset again shows a sparse matrix as compared to the jazz dataset. This implies that the jazz dataset has a higher variety of note length transitions within a song while having a lower diversity of note length transition across the dataset.

Table 3 Experimental result for characteristic comparison of generation models (see Sect. 4.2)

In data-driven approaches to music generation, the output of the generative system should directly relate to the characteristics of the training dataset. The presented absolute measures allow for a musically intuitive way of highlighting various dimensions of such characteristics. This can help with the critical step of designing a generalizable dataset, possibly from various sources, for training a generative system.

4.2 Experiment 2: System comparison

The second experiment compares MidiNet [60], a generative adversarial network (GAN) for symbolic domain music generation, with the melody lookback recurrent neural network (Lookback RNN) of the Magenta project [58]. As discussed in the previous Sect. 3.4, the proposed objective evaluation can assist studying different model structures and behaviors when the training datasets for both models are available. In some cases, however, the training datasets are inaccessible as is the case for Magenta. Given this issue, we consider this scenario for the proposed method to compare the characteristics of different models. We again exploit the intra-set distances and the absolute measurement utilized in the previous experiment. Furthermore, we attempt to relate reported subjective evaluation results to the identified characteristics.

4.2.1 Input datasets

We implement and train the so-called MidiNet “Model 2” [60], below referred to as MidiNet 2, by using 526 MIDI tabs with 8 bars parsed from the TheoryTab.Footnote 2

The MidiNet model and the public accessible pre-trained model of Magenta’s Lookback RNN generate 100 samples each. Each sample contains a melody with 8 bars. The first bar is provided by the user while the remaining 7 bars are generated by the models.

4.2.2 Analysis and discussion

The results of Exp. 2 are shown in Table 3. It can be observed that the two model outputs are distinctly different in several dimensions such as pitch count, pitch interval and pitch range; this is shown by the fact that the mean values of the inter-set distances are larger than the mean values of both intra-set distances. Furthermore, the absolute measurements NC and PR indicate that MidiNet 2 tends to use more notes and has a higher average pitch range than Magenta’s lookback RNN.

The fact that the outputs of these two systems have been used previously in a subjective study [60, Sect. 5] allows us to compare the subjective results with these objective results. The listening test resulted in a comparable rating for the questions How real and How pleasing the model outputs are; for the question How interesting, however, MidiNet acquired a slightly higher rating. This interestingness result might be related to the characteristics of higher pitch range, pitch count, and note count that we find in the absolute measures.

Magenta’s RNN, on the other hand, shows a higher mean among the intra-set distances in these features; this somewhat contradicts the result of the subjective test. Therefore, we investigate this issue further by looking into the STD value, as a higher STD might hint at a lower reliability of the mean value. No clear conclusions can be drawn as the limited sample size in the listening test does not allow for more detailed analysis.

Finally, Fig. 4 showcases another visualization of data characteristics. The PDF of the intra-set distances among features (PCH, PCTM, NLH, and NLTM) is shown in a violin plot, an intuitive visualization of PDFs. The plot echoes the previous argument, where a significant higher skewness indicates a less diversified intra-set behavior and a higher STD indicates a lower reliability of the similarity measure.

Fig. 4
figure 4

Visualization of model characteristics through the PDFs of proposed intra-set self-distance (Sect. 4.2)

4.3 Experiment 3: Performance evaluation

The final experiment demonstrates the use case of evaluating a generative system. We compare two parametrizations of MidiNet, “Model 1” and “Model 2” [60]. Both models have identical architecture and share the same training data. The difference between the models is that one model does not use feature matching regularizers (MidiNet 1) while the other model does (MidiNet 2). Feature matching is a technique for stabilizing the GANs by urging the model follow patterns within the training data more closely [47].

Table 4 Experimental result for performance evaluation of generation model (see Sect. 4.3)

4.3.1 Input datasets

We randomly pick 100 melodies from the training dataset (see Sect. 4.2: 526 MIDI tabs each with 8 bars) and generate 100 samples of melodies with 8 bars each with the two models. To insure an fair comparison, the generation is performed with the same setup as in Sect. 4.2, where we provide one bar for priming and let each model generate 7 continuing bars.

4.3.2 Analysis and discussion

The results of Exp. 3 are shown in Table 4 and Fig. 5. When comparing the generated melodies with the training melodies, the model with active feature matching, MidiNet 2, appears to have a larger OA and smaller KLD across almost all features. This indicates that the feature matching is able to deliver the expected improvement. The intra-set distance metrics show that both models have—compared to the training dataset—a lower mean and standard deviation in most features. This implies that both systems lose the variety of the training samples. Rather than using the metrics for a quality ranking, we urge the user to use them as index of variability. They could also be used to catch, e.g., an extreme case of losing the variety referred to as mode collapse in GANs [47]. In this case, the model is only able to generate very similar samples although the training dataset has significant variability.

Figure 5 intuitively identifies pitch count (PC), note count (NC), and pitch interval (PI) as the features for which MidiNet 2 outperforms MidiNet 1 (KLD decrease and OA increase drastically). It also points to features such as pitch range (PR) and pitch count across bars (PC/bar) as the dimensions in which both MidiNet models struggle as indicated by a high KLD. Most importantly, the metrics provide the measurement with respect to human-interpretable musical features, allowing the user to easily pinpoint the strengths and weaknesses of different system designs.

Fig. 5
figure 5

Visualizing the model performance by the proposed KLD and OA metrics (Sect. 4.3)

Fig. 6
figure 6

An example of PDF of the intra-set and inter-set distances (Sect. 4.3)

We can also make one counter-intuitive observation: the KLD for the pitch class histogram features slightly increases from MidiNet 1 to MidiNet 2 while the overlapped areas (OA) become larger. This reveals the limitations of KLD as visualized in Fig. 6: the PDFs of the intra-set and inter-set distances of MidiNet 2 move toward the training data’s intra-set distances; however, the KLD measure fails to register a performance improvement. Since in discrete probability distributions, the KLD is calculated in an element-wise manner, PDFs with identical shape (as indicated by similar Kurtosis and Skewness) but shifted on the x-axis (distinct in mean value) yield insignificant differences in KLD. As mentioned in Sect. 3.4.3, the calculation of the OA can address these limitations of the KLD. On the other hand, OA can be misleading when the PDFs vary in their Kurtosis but have similar mean values; in this case, the KLD is able to indicate the differences.

5 Conclusion

Evaluation of generative models has been falling behind the system development itself. This is probably due to the challenges of assessing music aesthetic pieces in the aspect of summative evaluation [2], where human subjective tests are typically unavoidable. Given the challenges of required resources and listening experiment design, we have proposed to address this issue by using a formative objective evaluation for generative music models. This allows for reproducible, reliable, and comparable objective results. It also allows the analysis of large amounts of outputs instead of a small set of hand-picked samples.

The method can be applied to two main tasks, the analysis of characteristics or the objective evaluation with interpretable metrics. Given a pair of datasets, features rooted in musical domain knowledge are extracted, providing absolute measures to the user quantifying the characteristics of a dataset in various dimensions. When used as evaluation metric, a relative measurement allows to look into intra-set and inter-set distances with respect to the training and the output data. The statistic analysis with respect to both the absolute measure and the similarity measure serves as a tool for the analysis of quantifiable dataset characteristics. This analysis allows the researcher to draw conclusions about the system’s ability to model a certain musical feature of the training dataset, as well as to estimate the variability and the stability of different model designs.

We have released the evaluation framework as an open-source toolbox which implements the demonstrated evaluation and analysis methods along with visualization tools. Our future work will include the extension of the current toolbox with additional dimensions (e.g., dynamics) and to expand it toward polyphonic music. This toolbox is available in an online repository.Footnote 3