Keywords

1 Introduction

Music Information Retrieval uses different algorithms and techniques to extract structural content from musical files. There are mainly two approaches to this, on one hand, we can study an audio signal, like in [9, 10], on the other, we can take a symbolical musical file, like MIDI or MusicXML, and apply pattern recognition techniques to extract harmonic or melodic content, this approach is taken in [3, 5].

Many of the Music Information Retrieval problems addressed revolve around developing genre recommendation systems [7], style imitation with artificial intelligence and machine learning [4], chord recognition and harmonic extraction [12], instrument recognition from an audio file [14]. However there have been few attempts to analyze and extract musical form, among these [8], in which they find internal similarity within an audio file to find sections that relate among themselves and can be used as transitions between them. In [1], they analyze internal similarity of an audio file to create a general thumbnail, that is the minimal amount of music that represents the whole piece, they use this in order to simplify the search in extensive databases.

In symbolical music files analysis, there is a need to apply these kind of similarity analysis to create a system that classifies internal similarity of a musical piece in order to find its musical form. We argue that a system that can extract and identify the formal structure of symbolical music files can be useful to simplify other tasks, like automatic harmonic and chord labeling, removing redundancies in the calculation of repeating or similar sections. It can also be a useful pedagogical tool that can be used in music education contexts to help the student to better understand the concept of form.

In this work, we propose to apply the time series technique Dynamic Time Warping to a symbolical musical file. This algorithm is normally used in audio signals and it’s useful to calculate a measure of how different two signals are by finding the cost of transforming one into another by means of time stretching. The novelty of this work is that instead of working with audio files, we will use the much smaller symbolical representation of music. In order to classify the musical form, we run through all possible sub segments of the piece and compare them in a similarity matrix, similar to the one used in [1], we then group segments with maximal similarity and label them as a new section. The process is repeated until all the maximal similarities have been found. In the present work we apply the algorithm to find the repeating section of a rondo form.

2 Dynamic Time Warping

The DTW algorithm uses dynamic programming to find the optimum alignment between two time series. It does this by calculating the cost to align each point of the first to the second. Afterwards, it takes the minimal path of change needed to transform each point to the other. It’s very useful in cases where we need to compare sequences that are time stretched or transposed. Because of this flexibility we consider that its application to a musical context would bring optimal results.

The following is a brief review of the DTW as presented by Müller in [11]. If we have sequences \(X := (x_1, x_2, . . . , x_N)\) and \(Y :=(y_1, y_2, . . . , y_N)\), the warping path \(p=(p_1,\dots ,p_L)\) is defined as the assignment of \(x_{nl}\) to \(y_{ml}\). The path must satisfy the boundary condition that it always aligns the first elements, and the last elements of the sequences, respectively; it must be monotonic; and only move by unitary steps; also all elements from X and Y must be paired and there can be no repetitions. Müller also defines the total cost as

$$\begin{aligned} c_p(X,Y) := \sum _{l=1}^L c(x_{nl},y_{ml}). \end{aligned}$$
(1)

where c is the local cost, in this case we will use the euclidean distance as a measure of difference. To find the optimal cost we search for the path with the minimum value.

$$\begin{aligned} DTW(X,Y)=\min \{c_p(X, Y ) \,|\, p \,\,\,\text {is a transformation path}\} \end{aligned}$$
(2)

With this we find the minimum cost of the different possible paths of transforming X into Y. This measure will be useful to compare similarity segments within the piece. The system developed in this work uses the Fast DTW [13], which is based on dynamic programming concepts. The DTW measure can be used for query systems by comparing one small segment to different windows of a larger segment, this methodology is used in [6]. In order to apply this to our symbolic musical file, we shall use the sequence of notes given by a MusicXML file and treat them as a time series.

3 Segmentation Matrix

The proposed system was implemented in Python with the library Music21 [2], which is optimal for the parsing and processing of symbolical musical files like MIDI and MusicXML. In order to apply the DTW algorithm we need to prepare the data so it takes the form of a time series. We use the flat functionality in the music21 library to convert the xml file to a linear representation of notes and time offsets. Next, we translate the musical information into a time series of all the notes, thus having a sequence of order pairs giving the time stamp and the pitch of a particular event.

$$\begin{aligned} P= \{(time_i,pitch_i)\} \end{aligned}$$
(3)

Where \(0 \le i\le N\), and N is the total number of notes in the piece to analyze; we will call this: the events list.

As a next step we will create a list of all the possible sub segmentations of the events list, that is all the possible subsets of P in which all consecutive elements from i to j are present.

$$\begin{aligned} Segs =\{U(i,j) \subset P|\,\, \text { if}\,\, i \le k\le j, \implies (t_k,p_k)\in U(i,j) \} \end{aligned}$$
(4)

We will use the notation \(Segs_{i,j}\) to indicate the segment time consecutive notes from element i to j in the events list. We have \(N^2\) different segments, and we need to group them according to their similarity. In order to do this, we will create a similarity matrix in which each entry (pq) will have the similarity measure assigned between \(Segment_p\) and \(Segment_q\), that is

(5)
$$\begin{aligned} a_{p,q}=DTW(Seg_p,Seg_q) \end{aligned}$$
(6)

In this matrix we calculate the DTW similarity measure between all segments and thus can be used to obtain a measure to classify and group similar sections of the piece. The method for computing the DTW similarity value is exactly the same used for acoustic time series as shown by Müller [11], which uses dynamic programming to find the cost of transforming one series to the other, as was explained in the previous section.

4 Musical Tests

For the purpose of testing the system we choose two pieces that have a rondo form, in order to see if the algorithm correctly predicts the repeating pattern of the analysis.

In the Bagatelle in A minor, Fur Elise, from Beethoven we have a rondo structure in which there is a section A at the beginning, and the musical form as a whole is A - B - A - C - A. When we apply the DTW measures to this piece we obtain a clear indication of a repeating section. In this work we show an example of use of the algorithm to find the repeating section of the rondo piece, that is, the A part. In Fig. 1a, we have a plot of time offset to the cost of the current test window. The graph takes an initial window of size 30, and runs through the rest of the piece in steps of 10 notes, obtaining the cost of comparing the test window to each of the other segments, the initial window size and step length were chosen empirically. The program then increases the window size to check if a larger frame would give a better matching result for some sections. The windows change in size from 30 to 100 in steps of 10, Fig. 1 show the costs of different windows sizes. Given the local minimums that appear around offsets 20, 55 and 130 we see that there appears a clear similarity between these sections and the test window. Given that the test window was the initial part of the piece and repeats several times, we will call it section A, that in fact corresponds to the repeating section of the rondo. To analyze if the parts that differ from A are in fact sections B and C, we would need to apply the algorithm starting in the first offset not classified as section A, and search for similarities, this will be addressed in future work, but currently the algorithm shows a good result in classifying one section at a time.

In Mozart’s Piano Sonata No. 11 in A major, K331, movement 3, also known as Rondo Alla Turca, we also have a similar form like in Beethoven, and if we apply the DTW algorithm to the piece we get the cost graph shown in Fig. 1b. We can see that the beginning section of the piece repeats itself at time offsets around 40, 140, and 160. In 140 we have an almost identical similarity regardless of window size, while in the other cases we have some minor variation in the structure, as shown by the higher cost obtained. We will call this repeating section A. Again, to find the rest of the sections we would need to apply the algorithm to the rest of the piece that was not labeled as A, this analysis will be applied in future work.

Fig. 1.
figure 1

Cost analysis of an initial test window in: (a) Fur Elise and, (b) Rondo Alla Turca. The lines show the costs of different window sizes. The bottom line has a window size 30, while the top is 100. The middle lines increase in steps of 10 from the initial window size.

5 Conclusions

The use of DTW to compare internal sections of a musical piece gives us enough flexibility to compare sections of a work by allowing us to compare all the possible combinations of windows sizes, we use this to find section repetitions, variations and changes. The current work presents a way to identify a repeating section, and it will be further developed to find and classify all repeating sections. One of the main challenges to solve is that we don’t have yet a clear maximum value for the window size, so we have to decide what is the cost tolerance of a section and assign a degree of belonging of each part of the score, to the test window. In future work, this will lead to establish some fuzzy belonging functions to help us decide how we should segment the similarity sections given by the DTW algorithm. The current implementation only takes into account the height and time stamp of the note, however, in future work we could implement other kind of metrics to include more parameters, like harmonic distances, which can help to further classify similar parts and sections of a symbolical music file using features others than just the pitch of the notes. Another thing to consider is the range of the costs values, in this version, the cost is given directly by the DTW, but we can see that it can range from hundreds to thousands, it would be a real benefit to find an optimal normalization of values, so we can have a clearer estimation of what the cost is indicating. In conclusion, Dynamic Time Warping gives us a good estimation of how to find internal similarity on a musical work to find its structural form, we foresee that this kind of approaches, complemented with other similarity measures will contribute to future work in Music Information Retrieval research.