1 Introduction

With the immense popularity of music streaming services and easier accessibility to large repositories of digital music clips, efficient methods for music segmentation and structure analysis have been of demand in many music-related applications such as music understanding, retrieval, and recommendation.

For instance, many works have been proposed for content-based music retrieval and browsing. In content-based music retrieval, music clips are usually retrieved on the basis of the acoustic similarity in the music signals. Feature vectors representing acoustic characteristics of music are constructed for measuring the similarity in the music clips. Typical digital music clips are recorded for 3–5 min at 44,100 sampling rate, and a large amount of storage space is required for extracting multidimensional feature vectors from each sample or sampling window within a music clip. Moreover, searching through the music clips by measuring the similarities in the feature vectors requires unacceptable computational time.

One simple method to solve these problems is to use the average of the feature vectors for reducing the dimensionality and computation time. However, this can lead to poor performance in terms of precision and recall because the music features that are distinct depending on the sampling point could disappear during the averaging step [8]. Another method is to determine the most distinct segment of a music clip and consider its feature vector as a representation of the music clip. Such a segment can be used as a summary of the music clip because it is the most representative part of the music clip. Determination of such segments can be automatically performed by analyzing the music structure.

Music structure analysis (discovery) segments music clips and structuralizes them using signal analysis. In [15], Peeters classified the approaches for music structure analysis into two categories: in the sequence-based approach, the music audio signal is considered as a repetition of sequences of events. Most studies in this category employ the self-similarity matrix on the music clip, which was proposed in [4]. Sequence is defined as a set of successive times. The notion of sequence is closely related to the notion of melody (sequence of notes) or chord succession in popular music. The sequence can be observed by tracking the diagonal lines in the self-similarity matrix. In the state-based approach, the audio signal is considered as a succession of states. This approach relies mainly on the clustering algorithms. By clustering the checkerboard patterns in the self-similarity matrix, the candidates for the structural boundary are detected. The music clips are structuralized by applying a clustering algorithm on the music segments.

In our previous work [7], we constructed a self-similarity matrix of music clips using a constant-size window, on the basis of which music segmentation and categorization are performed. Even though its overall performance is good, we observe that there is still room for improvement depending on the music genre. Therefore, in this work, we revise our method to improve the accuracy of music segmentation and structure analysis. The main contributions of this paper are to use a beat-scale window for constructing a self-similarity matrix and to perform two-stage segment grouping for more accurate categorization.

The rest of this paper is organized as follows. In Section 2, we present a brief overview on the techniques used for music structure analysis. Sections 3 and 4 present the methods to generate a self-similarity matrix and segment music on the basis of its self-similarity matrix, respectively. In Section 5, a segment categorization method is presented. Section 6 describes the experiments we performed and shows some of the results. In the last section, we conclude the paper and provide some future direction.

2 Related work

Most works on music structure analysis and segmentation are involved with acoustic signal analysis. In the acoustic signal analysis, the acoustic features are extracted from the signal to represent its temporal characteristic. By analyzing the pattern of the features within a music clip, we can identify its structural property such as repetition of motives or melodies.

The music structure can be analyzed more effectively by constructing its self-similarity matrix. The self-similarity matrix, which was initially presented by Foote [4], is a very useful tool for observing the time structure of music and audio. This matrix can be obtained in a 2D domain by measuring the cosine similarities between the successive music signals. The checkerboard patterns in the matrix indicate the repetitive similarities, and the music structure can be determined by tracking those patterns. In [5], automatic audio segmentation methods were presented using the self-similarity matrix. For speech and music segmentation in a mixed source, the segment boundaries are extracted by applying the proposed checkerboard kernel and audio novelty score.

In [2], Cooper et al. proposed a scheme for automatically summarizing the music clips. The scheme determines the most representative segment of music clip by summing the self-similarity matrix over the segments. Furthermore, in [3], they presented a framework for summarizing music by structural analysis. For music summarization, a self-similarity matrix of the music clip is calculated, and the segment boundaries are detected by correlating a checkerboard kernel. In addition, those segments are clustered by matrix decomposition on the basis of the spectral statistics of each segment.

Lu et al. [12] proposed an effective approach for analyzing the structure of acoustic music data and discovering repeating patterns. They focused on the chroma-based features rather than on the timbre feature because the note information is used for the structural analysis. For the representation of accurate melody similarity, they used the constant Q transform in feature extraction and proposed a new similarity measure between the musical features.

In [13], Maddage et al. presented a method for music structure analysis using beat space segmentation. This method can detect the music chord and vocal/instrumental boundary as well. For this, the regions with similar melody are determined by matching the subchord patterns. In addition, the regions with similar content are determined on the basis of their vocal content. The regions with similar melody and content are used for identifying the music structures.

Peeters [16] presented a method for automatically estimating the structure of music tracks. The author constructed a higher order similarity matrix on the basis of the timbre and pitch-related features. The music segments are detected from the matrix, and a maximum-likelihood approach is used to simultaneously derive its sequence representation and the most-representative segment of each sequence.

Wang et al. [20] presented a method for recognizing the repeating patterns in acoustic music signals. Their approach is based on the constant Q transform, which is used to extract the musical note information. By measuring the melody/note similarity and applying adaptive the threshold setting method, noticeable repeating patterns can be recognized.

In [14], Paulus et al. proposed a method for describing the music structure by segmenting a music clip, grouping the repetition segments, and assigning musically meaningful labels. Three acoustic features were used for describing the different aspects of the acoustic signal, and a probabilistic fitness function was used to select the feature with the highest matching score to the input pieces. The musicological model consisting of N-grams was employed for labeling the segment groups.

Kaiser et al. [9] presented a method for automatically extracting musical structures of popular music. In order to segment a music clip into regions of acoustically similar frames, non-negative factorization (NMF) was applied to its self-similarity matrix. Based on the observation that structural parts can be easily modeled over the dimensions of the NMF decomposition, they presented a clustering algorithm that can explain the structure of the music clip.

In [18], Serrà et al. proposed an unsupervised method for detecting music boundaries by using time series structure features. Structure features were obtained by considering temporal lag information and estimating a bivariate probability density with Gaussian kernels. Calculation of differences between consecutive structure features yields a novelty curve whose peaks indicate boundary estimates.

3 Self-similarity matrix

3.1 Musical features

For music segmentation and its structure analysis, the musical features should be identified and analyzed. For that purpose, we use two musical features in this work.

  • Timbre: Timbre, also known as tone color, is a unique quality of sound that enables people to distinguish the musical instruments in a music clip. In this work, we use the mel-frequency cepstral coefficients (MFCC) to represent the timbre feature of the music clip. MFCC is one of the most important features used in speech recognition and music information retrieval (MIR) [11, 17, 19]. It is generally used with other acoustic features in diverse applications such as genre classification and audio similarity measurement. Particularly, we use the first five MFCC coefficients over the texture window (excluding the coefficient corresponding to the DC component) as was performed in [19].

  • Chromagram: Chromagram, also known as pitch class profile, is a 12-dimensional vector, which represents the intensities of the 12-semitone pitch classes [6]. The energies of all the semitones over all the octaves are integrated into a single band called pitch class. This chromagram is effective in generating high-level music description such as melody and chord progression; therefore, it can be used in music structural analysis, particularly for detecting the music repetition. In this work, we use this feature to distinguish the segments with similar timbres.

3.2 Preprocessing

When generating a self-similarity matrix, an appropriate window size is very important. For instance, a small window size is effective in capturing the delicate feature transitions as required in the sequence-based structure analysis. On the other hand, state-based structure analysis prefers a large window size because it requires feature transition on a larger scale. In addition, the window size has a very close relationship with the time for generating and calculating a self-similarity matrix. For example, a smaller window size results in a larger self-similarity matrix, as shown in Fig. 1. Music tempo is another crucial factor in selecting the appropriate window size. Generally, songs with faster tempo undergo timbre transition in a shorter time period. Therefore, a smaller size window should be used for songs with faster tempo to capture the feature transition more accurately.

Fig. 1
figure 1

Self-similarity matrices of different window sizes

Generally, the window size is defined in the time domain. However, in the case of music, the same window size can cover different music ranges according to its tempo. Therefore, for performing consistent feature transition analysis, we define the window size on the basis of the number of beats using Eq. (1). In this equation, bpm represents the number of beats per minute of a music clip, and it is calculated using the tempo extraction algorithm described in [10]. In this paper, the window size is 2 beats.

$$ \mathrm{window}\ \mathrm{size}=\frac{60}{ bpm}\times \# of\ beats\times sampling\ rate $$
(1)

The self-similarity matrix is generated on the basis of the feature similarities within a song. Particularly, for generating a more distinct self-similarity matrix, we apply the principal component analysis (PCA) to the music feature, which leads to maximized feature difference in songs. More specifically, we calculate the PCA coefficients and the mean vector using the MFCC feature of 210 songs. Those songs were collected from Allmusic.com [1] and cover seven popular genres (blues, electronic, jazz, new age, rap, R&B, and rock). Simultaneously, PCA reduces the dimensionality of the feature vectors from 5 to 3. Figure 2 shows the comparison of the original self-similarity matrix and the PCA-applied self-similarity matrix. As shown in this figure, the PCA-applied self-similarity matrix is clearer than the original self-similarity matrix.

Fig. 2
figure 2

PCA effect on the self-similarity matrices

3.3 Self-similarity measurement

As mentioned earlier, the self-similarity matrix can be used to determine the boundaries where the timbre features significantly vary within a music clip. In this work, we construct the matrix by calculating the cosine similarity in all the feature vector pairs within the music clip. Each entry in the matrix, ssm, is calculated using the following cosine similarity equation

$$ \mathrm{ssm}\left(i,j\right)= similarity\left({v}_i,{v}_j\right)=\frac{v_i\cdot {v}_j}{\left\Vert {v}_i\right\Vert \times \left\Vert {v}_j\right\Vert } $$
(2)

where v i and v j are the feature vectors within the music clip. In some cases, it is difficult for the matrix to recognize owing to quick and wide transition within the matrix. For instance, the matrix for a song of vocal or acoustic instruments tends to show rough transitions. In order to intenerate the matrix, we have applied a two-dimensional median filter to the self-similarity matrix. Figure 3(a) and (b) show the self-similarity matrix before and after filtering, respectively.

Fig. 3
figure 3

Filtering effect on the self-similarity matrices

4 Music segmentation

4.1 Sum of row-wise standard deviations

For capturing the significant changes in successive checkerboard pattern in the self-similarity matrix, we use a sliding window whose size is up to 20 samples and compute the summation of its standard deviation. Figure 4 shows an example of the sliding window and its standard deviation. The standard deviation for a vector x is defined as follows:

Fig. 4
figure 4

Sliding window and its standard deviation

$$ \mathrm{SD}(x)={\left(\frac{1}{n}{\displaystyle {\sum}_{i=1}^n{\left({x}_i-\overline{x}\right)}^2}\right)}^{\frac{1}{2}} $$
(3)

In addition, the sum of the row-wise standard deviations (SRSD) in the m-th window is calculated as follows:

$$ \mathrm{SRSD}(m)={\displaystyle {\sum}_{i=1}^N SD\left( rv\left(m,i\right)\right)} $$
(4)

Here, N is the length of the matrix, and rv(m,i) is defined as

$$ rv\left(m,i\right)=\left\{ ssm\left(i,m\right), ssm\left(i,m+1\right),\cdots, ssm\left(i,m+L-1\right)\right\} $$
(5)

where L denotes the window length, and we set it to 20 to cover sufficient transitions between the segments.

4.2 Segmentation

For music segmentation, we need to detect the boundaries of different music features. To detect such boundaries, we first locate the local peak in the SRSD. Since we are interested in conspicuous checkerboard pattern in the self-similarity matrix, we consider local peaks whose size is above some threshold. The threshold can be defined as the average of SRSD within a music clip. On the other hand, small and short peaks are removed by using an eight-size moving average filter. The detailed description of this is illustrated in Fig. 5.

Fig. 5
figure 5

Music segmentation

5 Segment categorization

In this section, we describe the two-stage categorization method for feature-based music segment classification. In the first stage, the segments with similar timbre are grouped together using the self-similarity matrix. In the second stage, the segments in each group are subgrouped according to their chromagram sequence. The overall steps for categorization are illustrated in Fig. 6.

Fig. 6
figure 6

Two-stage categorization method

5.1 Categorization using the self-similarity matrix

After segmentation, the music clip is divided into a few segments. However, such segmentation does not have any structural information. Music structure such as repeating segments or neighboring similar segments can be defined by comparing the segments in the self-similarity matrix. Figure 7 shows an example of the comparison of the segments. For comparing the 3th and 5th segments, the average of the similarities in the intersection area in the vertical and horizontal directions is calculated. When the average similarity is high, we assume that the two segments are very similar and have the same state. On the basis of this intuition, we perform categorization from the segments with the highest similarity to the segments with the lowest similarity. The segments are categorized with their highest similarity pair until all the segments are categorized. As a result, the music clips typically have three to five categories Fig. 8.

Fig. 7
figure 7

Similarity measurement among the segments

Fig. 8
figure 8

Categorization using the self-similarity matrix

5.2 Categorization using chromagram

Categorization on the basis of the self-similarity matrix considers timbre feature only for segment grouping. As a result, the segments with similar timbre are categorized together. In this work, for more delicate categorization, we perform additional comparison for chromagram sequences of segments using the cross-correlation function. Figure 9 shows the detailed steps for categorization, and Fig. 10 shows an example of the chromagram sequence comparison.

Fig. 9
figure 9

Categorization using chromagram

Fig. 10
figure 10

Example of chromagram comparison

6 Experimental results

6.1 Segment boundary hit accuracy

For music segmentation, we collected 50 popular songs over various genres. After listening to the songs several times, the subjects were requested to define the boundary set bH, which denotes the music structural changes, such as the transition from verse to chorus or bridge, and the music mood changes based on personal perception. Another boundary set bA is defined by our proposed method. By comparing the two boundary sets bH and bA, we estimate the effectiveness of our method. Here, we consider that the two boundaries are same when their interval difference is less than 3 s. The average precision and recall accuracies are summarized in Table 1.

Table 1 Average segmentation accuracies for 50 songs

6.2 Comparison of the segmentation methods

In this experiment, we compare the proposed segmentation method (SRSD) with the CBK method. The proposed segmentation method uses the sum of row-wise standard deviations of the self-similarity matrix. The CBK method uses the checkerboard kernel for segmenting the music clips on the basis of the self-similarity matrix. Figure 11 shows the comparison of the two methods for segmenting 50 popular songs. As shown in this figure, the proposed method outperforms in accuracy and computation time.

Fig. 11
figure 11

Comparison of the two segmentation methods

6.3 Comparison of the categorization methods

For evaluate the effectiveness of the two-stage categorization method, 50 popular songs are categorized using (i) a self-similarity matrix and (ii) a self-similarity matrix and chomagram, and the segmentation results are presented to ten subjects for evaluation. The subjects listened to all the segments in the category of 50 popular songs and evaluated whether those segments are really similar. Fig. 12 shows the results of the user evaluation for 50 songs. As mentioned earlier, the songs cover seven genres. We calculate the average of the categorization accuracy by genre because the music structure differs depending on the genre. Overall, the average categorization accuracy is 74.86 % and 80.4 % for single- and multi-stage categorization, respectively.

Fig. 12
figure 12

Performance of the two categorization methods

7 Conclusion

In this paper, we proposed a method for music structure analysis on the basis of segmentation and categorization. Music segmentation was performed using a self-similarity matrix, which is constructed by calculating the similarities among the timbre features. For segment categorization, we presented a two-stage categorization method, which uses the timbre and chromagram features of the music clips. For evaluating the performance of our method, we carried out experiments for songs of seven genres. The experimental results show that our method achieves reasonable performance with regard to music segmentation and categorization.

A few important issues still need to be tackled further. Basically, our scheme belongs to the state approach. Even though this approach is computationally less expensive and more robust than the sequence approach, it is poor at performing delicate structure analysis. Future work involves combining the sequence approach and the state approach to facilitate more delicate structure analysis with less expensive computations. We also wish to further investigate how to apply our method in music retrieval and recommendation.