Keywords

1 Introduction

Automatic music transcription (AMT) consists in detecting the notes being played in a musical piece, via a machine. This problem is comprised of several sub-problems, which makes a solution for it hard to find. In this work, we mainly focus on the variant called multi-pitch estimation. Multi-pitch estimation consists in identifying the pitched notes present in a polyphonic musical piece. A common approach to this problem is to split a musical piece into smaller chunks, referred to as frames, and then, estimate the pitch(es) present in each frame (see Fig. 1).

Fig. 1.
figure 1

A common approach to tackle the multi-pitch estimation problem. (a) A musical piece is split into frames. (b) The pitch(es) is estimated for each frame.

We apply artificial neural networks (ANNs) to tackle the multi-pitch estimation problem. ANNs have been applied in several different types of problems as, for example, object recognition, image segmentation, speech recognition, text-to-speech synthesis, and, also, music transcription [1,2,3,4].

The traditional approach to the AMT problem, especially when ANNs are applied, consists in having a single module/network that is responsible for detecting and transcribing all the musical notes in each frame (see Fig. 2a) [2,3,4]. In this work we use a divide and conquer approach which translates into using several networks, referred to as classifiers, each one responsible for detecting and transcribing one musical note only (see Fig. 2b). This approach aims at dividing the AMT problem into smaller sub-problems, hopefully, easier to solve, possibly boosting the performance of the whole AMT system.

Fig. 2.
figure 2

Representation of (a) the traditional approach for the AMT problem, where a single classifier is responsible for transcribing all musical notes and (b) the divide and conquer approach, where several classifiers are used, each one responsible for identifying only one note.

As the pitch estimation process is far from perfect, errors are common. Specifically, two types of errors may arise: (i) musical notes that are not present in a frame are identified as being there and/or, conversely, (ii) notes that are in a frame are not identified. To reduce these types of errors, post-processing methods can be applied. In this work, we propose additional ANNs for that purpose, again following a divide and conquer approach.

Some previous works [5, 6] have already applied the divide and conquer approach, however, in none of them a comparison with the traditional approach was presented, using the same setup: same dataset and/or same techniques. In this work a comparison is performed between the divide and conquer paradigm and the traditional one, using the same dataset, as well as, the same artificial techniques.

The rest of the paper is structured as follows: Sect. 2 describes related work. Section 3 presents our model and Sect. 4 presents and compares the results with other state-of-the-art works. Conclusions and future work are given in Sect. 5.

2 Related Work

Since the first polyphonic music transcription system [7], several approaches have been presented. In 1992, Lea [8] proposed a method that iteratively extracted the predominant peaks. In 2000, Bello and Sandler [9] proposed a simple polyphonic music transcription system using a blackboard system. In 2003, Klapuri et al. [10] introduced an algorithm based on harmonicity and spectral smoothness. Also, in 2003 [11], the non-negative matrix factorization technique was introduced for the first time to the AMT problem. In 2004, Moorer [5] introduced for the first time the divide and conquer design paradigm to the AMT field, using Artificial Neural Networks. In 2007, Emiya et al. [12] designed a multi-pitch estimation system based on the likelihood maximization principle. In 2008, [13] Yeh proposed a frame-based system to estimate multiple fundamental frequencies of polyphonic music signals. In 2012, Reis et al. [14] introduced for the first time a combination of genetic algorithms with an onset detection algorithm. In 2016, Leite et al. [6] pioneered the coupling of Cartesian Genetic Algorithms with the AMT field, also relying on the divide and conquer paradigm. Also in 2016 [2], Convolutional Neural Networks were first introduced to the AMT problem, combined with a complex language model to improve their results. In 2016, Kelz et al. [3] proposed a simpler approach to AMT using solely Convolutional Neural Networks. Finally, in 2018, Hawthorne et al. [4] proposed a system that comprises an onset detector and a multi-pitch estimator based on ANNs.

3 Proposed Model

The proposed model consists in a supervised learning system based on several ANNs, each one responsible for transcribing one musical note, resulting in a total of 88 ANNs per dataset, corresponding to the keys in a grand piano. In this work, we have applied classic Multi-Layer Perceptron Neural Networks, instead of more recent techniques as the ones used in Deep Learning, in order to get baseline results. The model is comprised by three sequential main stages: (1) pre-processing, (2) classification and (3) post-processing. In the following sections, a deeper explanation of each is presented.

3.1 Pre-processing

The pre-processing stage is responsible for splitting the musical pieces into frames and also for converting each frame to the frequency domain, using the Fast Fourier Transform (FFT). Although each frame is comprised of 4096 samples, only the first half is taken into account since the second half of the signal mirrors the first half.

Regarding the ANNs training process, a fundamental key point to consider is the quality of the data. Hence, this stage is responsible for applying two additional sequential transformations used solely in the training set: (i) removal of meaningless data, such as frames with silence and (ii) adapting the ratio between frames with and without the note that should be identified by each classifier, more specifically, 20% of frames with the given note and 80% frames without it (see Fig. 3).

Fig. 3.
figure 3

Transformations applied during the pre-processing stage.

3.2 Classification

The classification stage is where the actual transcription process starts. The resulting data from the pre-processing stage is inserted into this stage so that the note can be detected. As already mentioned, we use the divide and conquer approach, thus, 88 classifiers were created, each one responsible for transcribing one note (see Fig. 2b).

Each classifier comprises five hidden layers with 256, 128, 64, 32 and 8 units, respectively, and an output layer with one unit (yes or no output). The hidden layers apply the leaky relu activation function, while the output layer uses the sigmoid function. During the training phase, the optimizer chosen was Adam [15] combined with a learning rate of \( 1 \times 10^{ - 6} \) and the cross-entropy loss function. Also, the following optimization techniques were applied: data shuffling [18]; dropout [16] with a probability of 0.15; noisy gradients [17] with a probability of 0.70 and a standard deviation of 0.05.

3.3 Post-processing

As mentioned earlier, post-processing methods can be applied to correct errors from the classification process. In this work, we use three different types of post-processing methods, labeled as: (i) step 1 - fix notes duration, (ii) step 2 - fix notes duration according to onsets and (iii) step 3 - fix notes onset. Each type applies the divide and conquer approach, where an ANN is created to post-process one musical note only, resulting on 88 ANNs per post-processing step. This means that the whole post-processing stage comprises \( 88 \times 3 = 264 \) ANNs. In the topics that follow, we detail each post-processing type.

  • Step 1 - Fix notes duration

Music is a time-series phenomenon. By this, we mean that a given event is closely related to a previous and/or a following one. The first post-processing step aims at incorporating that sense of time in the transcription process. Thus, each ANN in this step receives as input the output of the corresponding note classifier, from the classification stage, as well as, the output for some preceding and following frames, and gives as output a yes or no answer. This way, it assesses whether the middle frame of the sequence contains the specific note or not. See Fig. 4, below, for an example.

Fig. 4.
figure 4

Example of how a post-processing unit from step 1 works. The 1’s represent a frame that was identified with a specific musical note and the 0’s the opposite scenario. The squared window around a portion of the input data represents the sequence given to the post-processing unit. The number in the middle of that window, represents the frame that the unit is trying to predict. Finally, the number represented below the output data illustrates the actual prediction from the post-processing unit.

Note that during the post-processing, the squared window represented in the figure above, will slide to the right, one frame at a time, until the last frame, and for each sequence contained on that window, an output prediction is given for the middle frame. When finished, the whole set of new predictions represents the resultant transcription of the system (see Fig. 5).

Fig. 5.
figure 5

Representation of all the sequences given to step 1, regarding the previous example. The numbers in red represent wrongly transcribed frames. (a) All the input sequences given to the unit and its resultant predictions. (b) Representation of the previous and new transcription.

It is important to point out that, the sequences given to a post-processing unit do not contain binary data (only zeros or ones) but, instead, values between 0 and 1 (because the classifiers’ output unit use the sigmoid activation function). However, for ease of understanding all the examples given in this section represent those values as binary data.

  • Step 2 - Fix notes duration according to onsets

For further improvement, an additional post-processing step was created (see Fig. 6). This new step is like the previous one since it receives as input a sequence of previously transcribed frames from step 1 and it also tries to predict the possible transcription for the middle frame of that sequence. However, it also receives two additional sequences: one sequence with the original transcription from the classification stage and another one based on the output of an onset detector algorithm [19]. An onset consists in the start time of a musical note.

Fig. 6.
figure 6

Representation of the three different sequences received by step 2 post-processing units.

The rationale behind the concept of receiving the original transcription from the classification stage is based on stacked systems [20], where an additional system receives as input the output of the previous step, as well as, the original input.

Note that the onset detector algorithm applied is not perfect and is also not able to distinguish between onsets of different musical notes. Thus, these post-processing units need to deal with problems like: (i) falsely and missing detected onsets and (ii) onsets of other musical notes.

  • Step 3 - Fix notes onset

To refine our model in terms of onset detection, an additional post-processing step was added. In this step, only the frames predicted as note onsets are targeted. Specifically, for each predicted note onset, these post-processing units decide whether a readjustment is needed or not. Therefore, they can output three possible transformations: SHIFT LEFT, ACCEPT and SHIFT RIGHT. An example of the three possible transformations is shown in Fig. 7.

Fig. 7.
figure 7

Three possible transformations in step 3 of the post-processing stage.

This post-processing unit receives two sequences as input. One with the corresponding transcription of the note onset and nearby frames (previous and following four frames), and a second sequence, with the output of the onset algorithm, used in the previous step. Thus, regarding the example represented in Fig. 7, the input data received by this post-processing unit could be as demonstrated in Fig. 8:

Fig. 8.
figure 8

Illustration of the given input data with all three types of possible transformations. (a) Scenario where the note onset should be readjusted to one frame before. (b) Scenario where the onset is considered already correct. (c) Scenario where the onset should be shifted to one frame after.

4 Results

In this section, results and a comparison with similar techniques used in other state-of-the-art works are presented. First, the dataset is described, followed by the metrics used for comparing our approach. Then, the results obtained are shown, and finally, a comparison with other research works is given.

4.1 Dataset

To be able to compare our approach with already existent ones, we use the Configuration 1 dataset from [2], based on MAPS [21]. This dataset comprises four folds, each one containing a different combination of musical pieces, with 216 musical pieces in the training set and 54 pieces in the testing set. This means that, for each fold, a transcription system comprised of 88 ANNs for the classification stage and 264 ANNs (\( 88 \times 3 \)) for the post-processing stage, need to be created.

4.2 Metrics

We use both frame-based and note-based metrics [22] to compare our model. Frame-based metrics consists on evaluating frame-by-frame the final transcription, whereas note-based consists on evaluating each transcribed musical note by considering its pitch and its onset. Regarding the note onset, we also assume a tolerance of ± 50 ms.

We use precision, recall and f-measure for both frame-based and note-based evaluation metrics. Mathematically, these metrics can be expressed as:

$$ Precision \, \left( P \right) = \frac{TP}{TP + FP} $$
(1)
$$ Recall \, \left( R \right) \, = \frac{TP}{TP + FN} $$
(2)
$$ F\text{-}measure \, \left( F \right) \, = \frac{2 \times P \times R}{P + R}, $$
(3)

where TP represents true positives, which consist on correctly identified frames/notes, FP represents false positives, which consist on wrongly detected frames/notes and FN represents false negatives, which consist on missed detected frames/notes.

4.3 Results and Comparison

The obtained results from our model, per each step are presented in Table 1, below.

Table 1. Obtained results per each step.

From the table above, we may conclude that the post-processing stage played an essential role in the improvement of the transcription results. The frame-based metrics were improved by an amount of 13.58% and the note-based by 28.56%. To better evaluate how distant our system is from the expected transcription, a portion of the resultant and expected transcription from the musical piece BMW 846 Prelude in C Major from J. S. Bach is shown in Fig. 9.

Fig. 9.
figure 9

Portion of the expected and resulting transcription of the BMW 847 Prelude in C Major. (a) Expected transcription. (b) Resulting transcription.

To compare our results, two state-of-the-art works were chosen: [7] and [8]. Both apply the same dataset as well as the same type of ANNs. The comparison is shown in Table 2.

Table 2. Comparison with two other state-of-the-art works.

From the table above, one can conclude that our approach significantly surpasses both works in frame-based metrics, while reaching similar results in note-based.

To assess our approach against more recent artificial intelligence techniques such as Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs), we further compare it to three other systems, taking advantage of the fact that these three systems have also used the same dataset, thus making the comparison feasible. Table 3 encompasses the results from all the approaches.

Table 3. Comparison with works that apply more recent types of artificial neural networks.

Even when our approach is compared with works that implement more recent types of ANNs, it still yields higher frame-based metrics than those systems, and, at the same time, it reaches comparable results in note-based metrics. This demonstrates the effectiveness of our approach.

5 Conclusions and Future Work

In this paper, we tackled the AMT problem using a divide and conquer approach. The obtained results show that this is a promising path for tackling the AMT problem, since they surpassed current state-of-the-art works in frame-based metrics and reach similar metrics in note-based, even when compared with other systems that apply more recent types of artificial neural networks. The use of artificial neural networks as post-processing units demonstrated to be essential for improving the whole performance of the system. In the future, a comparison could be done between post-processing units that take advantage of artificial neural networks and traditional statistical methods, such as Hidden Markov Models, in order to understand which one is better.

To conclude, there is still plenty of space for future work. For instance, in the case of the classifiers, other techniques could be used, like Recurrent Neural Networks or Convolutional Neural Networks. In addition, an improved version of the onset algorithm could also be used. Thus, a possible solution could be the creation of an additional ANN to filter false positives from this original onset algorithm, or, instead, create an onset algorithm from-scratch using deep learning techniques, like some authors propose [23].