Keywords

1 Introduction

The contemporary personalized ways of music processing techniques are changing the nature of music dissemination allowing ardent music lovers to get access to large music collections widespread over genres and artists. Thus, various researchers have contributed in distinct ways for the growth of music dissemination leading to the invention of MIR systems. The MIR research is exploring new technique referred as QBH system for music database search using humming a short music excerpt.

So far, most of the QBH research efforts [17] have focused on building systems based on dynamic programming [7], hierarchical filtering [1], iterative deepening melody alignment [2, 6], time series alignment [3], dynamic time warping (DTW) [4], wavelet transforms, and envelope transforms [5] approaches. Although these techniques exhibit good experimental performance, there is a need of consistency estimation in understanding the nature of a music signal. Also to conquer the inadequacy of using big knowledge base, music indexing is being used.

The growing scientific and commercial interest in audio and music processing is seen recently in the works of various MIR applications. They are based on pitch [8, 9], rhythm [10], and note [5] estimation. In this paper, we are proposing new music melody estimation approach based on multiscale entropy (ME). Also we present, Kd tree-based music melody database indexing structure for efficient and reduced space searching.

This paper is organized as follows. In Sect. 2, comprehension of related contributions is presented. The brief view of the proposed QBH system is described in Sect. 3. The entropy estimation procedure, Kd tree-based indexing, query processing, and searching are presented in Sects. 4, 5, and 6, respectively. The results and discussions are portrayed in Sect. 7. Finally, we draw the conclusions in Sect. 8.

2 Comprehension of Related Contributions

The music signals often exhibit complex structure of regularity. Through the regularity estimation, it is possible to perceive the characteristics of the music signals. In the past, several applications are developed based on regularity estimation through entropy such as heart rate analysis [11], music beat tracking [12], traffic flow analysis [13], human gait dynamics [14], and measuring predictability in networks [15].

The effectiveness of these applications motivates us to exploit the potential of regularity estimation techniques for developing QBH systems. Several methods [12, 1619] on entropy estimation have been developed to detect specific pattern in music signals. The task of beat tracking is popularly used mechanism in music information retrieval (MIR). Davies and Plumbley [12] have explored a new method for the evaluation of beat tracking systems.

Another work [16] focuses on music retrieval based on emotion recognition. The system enhances maximum entropy with Gaussian and Laplace priors to model features for music emotion recognition. Authors were successful in getting better results for maximum entropy with priors. An interdisciplinary research [17] was proposed to explore the concept of entropy in jazz improvisation. In order to measure the improvisation, they adapted standard melodic, harmonic, and rhythmic entropy analysis techniques.

Subsequently, the research work [18] addresses matching of music performances of the same piece of music, making it suitable for monitoring applications. The entropy-based audio finger printing was employed to represent music in the form of symbolic strings. The system was able to correctly identify Pop music with DTW, edit distance (ED), and longest common subsequence (LCS) distance measures in less than a second per comparison.

An information-based approach [19] was proposed for meaningful musical events analysis and recognition. The entropy derived from a predictive model of music corresponds to important feature of music. Also, neural network model was employed to estimate instantaneous entropy for music. Even the relationship between listener’s subjective sense of tension and entropy is considered in the work.

The majority of the music processing systems have focused on developing holistic single sample entropy-based techniques for capturing regularity in the music signals. Nevertheless, they cannot capture the long-term structures and large amount of information present in the music signals. In regard to this disadvantage, we are proposing MME technique, which uses single sample entropies of a music signal at multiple scales to uncover several types of structure.

3 Proposed Query by Humming System

In order to illustrate the proposed system of QBH, let us consider distinguished training and testing phases. Each of these phases perform different operations on the input signal such as preprocessing, framing, entropy estimation, Kd tree-based indexing, and searching. Through the simplified block diagram of a general system as shown in Fig. 1, the proposed QBH system’s steps are discussed.

Fig. 1
figure 1

Block diagram of the proposed QBH system

3.1 Preprocessing

Preprocessing is applied on the MP3 songs database to remove nonessential information, for down sampling and converting to mono channel [20].

3.2 Framing

In framing, the music signal is split into several frames, such that each frame is analyzed in short time instead of analyzing the entire signal at once [20]. In our experiments, the frame length is set to 25 with 10 ms overlap between two adjacent frames.

4 Entropy Estimation

The music signal data points often exhibit complex fluctuations containing information about the underlying regularity or consistency. Hence an important problem in music processing is determining whether the music signal has a regularity component. In this section, we discuss the entropy estimation procedure which can be quite effectively used to quantify the amount of regularity.

4.1 Shannon Entropy

Shannon’s information theory defined a quantitative measure of information associated with an event A, which is a set of outcomes of some experiment. If P(A) is the probability of an event A, then information associated with A is given by

$$ i(A) = \log_{b} \frac{1}{P(A)} = - \log_{b} P(A) $$
(1)

So, the amount of information associated with low probability event is more and vice versa. If we have a set of independent events A i , which are sets of outcomes of experiment S, then the average information associated with the experiment is given by

$$ H (S )= \sum\limits_{i = 1}^{N} {P\left( {A_{i} } \right)i\left( {A_{i} } \right) = } - \sum\limits_{i = 1}^{N} {P\left( {A_{i} } \right)\log_{b} P\left( {A_{i} } \right)} $$
(2)

Here, the quantity H(S) is an information entropy or Shannon entropy of S. So entropy is defined as measure of mean information content or information consistency or information regularity [21]. The choice of the logarithm base in Eq. (2) determines the unit of information. In this work, we have used the bits units obtained by the use of logarithm base 2.

4.2 Multiscale Entropy

Conventional single scale entropy measures, purely quantify the regularity of data samples. Significant structural richness or complexity which assimilates correlations over multiple spatiotemporal scales can be derived through multiscale analysis [22]. So, ME is adopted as a new method for measuring the complexity of data samples.

The Fig. 2a describes the pictorial representation of coarse graining baseline procedure for scale 1, 2, and 3. For a given data set, \( X = \left\{ {x_{ 1} ,x_{ 2} ,x_{ 3} , \ldots ,x_{N} } \right\}, \) multiple coarse grained sequences are constructed by averaging the data points within nonoverlapping windows of increasing length. Each element of the coarse grained sequences, \( y_{j}^{\delta } , \) is generated according to the equation:

$$ y_{j}^{\delta } = \frac{1}{\delta }\sum\limits_{i = (j - 1)\delta + 1}^{j\delta } {x_{i} } $$
(3)

where δ represents scale level 1–l and 1 <= j < = N/δ. The length of the coarse grained sequence is M = N/δ. For scale level δ = 1, the coarse grained sequence is same as the original sequence x i where i = 1–N.

Fig. 2
figure 2

a Schematic illustration of the coarse graining procedure for scale 1, 2, and 3. b MME estimation procedure

We then render the entropy measurement e δ of each coarse grained sequence \( y_{j}^{\delta } \) of length M for scale level δ = 1–l as defined in the following equation:

$$ e_{\delta } = - \sum\limits_{j = 1}^{M} {P\left( {y_{j}^{\delta } } \right)} \log_{b} P\left( {y_{j}^{\delta } } \right) $$
(4)

where \( P\left(y_{j}^{\delta }\right) \) indicates probability of data point y j at scale level δ and b is logarithm base. In this work, we have used the bits units obtained by the use of logarithm base 10. Now, ME is defined as the feature vector of entropies e δ for scale levels δ = 1–l as shown in Eq. (5):

$$ {\text{ME(}}X ,\delta )= \mathop {II}\limits_{\delta = 1}^{t} e_{\delta } $$
(5)

where X is an original data set.

4.3 Multiscale Music Entropy

In this section, we reproduce the analysis of ME with reference to music signals. Given a music signal database

$$ D = \left\{ {S_{1} ,S_{2} , \ldots S_{R} } \right\} $$
(6)

and a query Q, the aim is to find all the music signals in D that contain the specified query Q as well as feature vector similar to that of Q. One dimensional music signal S is defined as a sequence of values:

$$ S = \left[ {s_{1} ,s_{2} , \ldots s_{N} } \right] $$
(7)

where N is the number of samples in S. Primary objective is to represent music melody in the form of ME.

Each music signal S is split into several frames, as defined in Eq. (8), so that it is possible to analyze each frame in short time.

$$ S = \left[ {f_{1} ,f_{2} , \ldots f_{k} } \right] $$
(8)

where \( f_{ 1} = \left[ {s_{ 1} ,s_{ 2} , \ldots s_{ 2 50} } \right],\;f_{ 2} = \left[ {s_{ 2 5 1} ,s_{ 2 5 2} , \ldots s_{ 500} } \right],\; \ldots f_{k} \). So, for every frame f, MEf i, feature vector of entropies e δ for scale levels δ = 1–l are constructed by following the procedure discussed in Sect. 4.2. Same is defined in Eq. (9):

$$ {\text{MEf}}_{i} = \mathop {II}\limits_{\delta = 1}^{t} e_{\delta } $$
(9)

where i = 1–k.

Finally feature vectors, MEf, of frames, f 1 − f k , are concatenated to form the feature vector of one music signal referred as MME, expressed through equation as:

$$ {\text{MME}} = \mathop {II}\limits_{i = 1}^{t} {\text{MEf}}_{i} $$
(10)

So, this process is repeated for all music signals and knowledge base of MME is built during training phase. MME construction system for database D is depicted in Fig. 2b and steps are shown in Algorithm 1.

5 Kd Tree-Based Indexing

The Kd tree is a binary tree, which iteratively splits the entire input space into partitions. Each internal node represents splitting condition and leaves of this node denote subsets of the partition. Thus, the root node represents the entire input space, while the leaf nodes explicitly record the data points of smallest partitions [23]. The tree construction process becomes accustomed to the local characteristics of input space and the sizes of partitions at the same phase are not necessarily equal.

Our implementation employs the multiscale music entropy-balanced (MMEB) Kd tree for music database indexing. The tree indexing utilizes the MME statistical properties of music data for efficient and accurate searches. The MMEB Kd tree is an extension of statistical Kd tree which optimizes the multidimensional decision tree based on balancing split decisions. This balancing policy is adopted during tree building process so that the average and worst case search depth is reduced.

To construct a tree from a MME knowledge base, top down recursive procedure is employed. The common variation of splitting a subset of partition is the median point µ of the dimension d. This method of splitting guarantees a balanced tree and leads to reduced search depth. The cost of building a tree from N data points is O (NlogN). The steps of building MMEB Kd tree are shown in Algorithm 2.

6 Query Processing and Searching

Query processing basically involves query preprocessing, framing, and entropy estimation as discussed in Sects. 3 and 4. The melody database is a set of MME feature vectors indexed by MMEB Kd tree. The user hums different proportions of the query melody which is segmented into frames of 25 ms. Subsequently, each frame is resolved into query multiscale music entropy (QMME) feature vector.

Query vector QMME is searched in MMEB Kd tree using K-Nearest Neighbor (K-NN) algorithm to find those music signals whose MME feature vectors are close to QMME. Tree is recursively searched from the root to leaves. In concurrence with neighborhood range, search is carried out to locate those branches, which are close to the branches where the query resides. The cost of searching MMEB Kd tree with N data points is O (logN). The steps of K-NN searching are shown in Algorithm 3.

7 Results and Discussions

In order to evaluate the proposed method, several experiments are conducted with datasets and interesting trends are observed in the performance. The performance evaluation is done through standard yardsticks such as mean reciprocal rank (MRR), mean of average (MoA), and Top X Hit Rate.

The training database chosen for the experiments consists of 1,000 Indian devotional monophonic songs. These are further segmented to 1,350 fragments. Then a subset of 100, 200, 500, 800, 1,000, and 1,350 fragments are employed during training phase. The query database consists of total 200 hums from ten participants. The humming query is searched in the music database and QBH system’s performance analysis is evaluated with different metrics such as MRR, MoA, and Top X Hit Rate [6, 7, 24].

7.1 Performance Analysis

MRR metric estimates any system which generates list of possible responses to a query. The probability of the target song coming in first position is estimated through MRR [20]. In the proposed strategy, MRR is obtained in the range 51.50–69.72 % for different scale levels. Also the performance of the system in terms of MRR goes on increasing with an increase in the scale level as depicted in Fig. 3. This is an indication that system’s potential ability of discriminating songs increases with higher scale levels of MME estimation.

Fig. 3
figure 3

QBH system’s performance analysis with MRR, MoA, and Top X Hit Rate

Similarly mean of accuracy (MoA) metric estimates the mean or average rank at which desired song is found for every query. The proposed strategy yields MoA in the range 59.23–45.32 % with scale levels one to ten. Experimental results demonstrate that MoA decreases with higher scale levels of MME estimation as depicted in Fig. 3. This is an indication that mean or average rank of the retrieved song decreases with increase in scale levels of MME estimation.

The third performance measure the Top X Hit Rate is employed to calculate approximately the percentage of successful queries. The different scale level of MME estimation procedure has significant effect on Top X Hit Rate and same is portrayed in Fig. 3. The Top X Hit Rate varies between 71.91 to 89.90 % for X value 10 and 1–10 scale levels of MME estimation. More humming queries return the desired song in the top 10 list when the scale level of MME estimation is increased from 1 to 10. The proposed QBH system based on MME exhibits significant performance in terms of accuracy, MRR and MoA metrics.

8 Conclusions

This paper is an attempt to exploit the advantages of Multiscale Entropy Analysis technique for estimating regularity and complex structure of music signals. Here, MME feature vectors are employed to represent music melody for music search through QBH system. Also, music melody is indexed through MMEB Kd tree to support fast and accurate search in the database to produce effective music retrievals. Proposed approach yields sensible performance by returning the desired song within the top 10 hits 89.90 % of the time and as the top hit 69.72 % of the time on a database with 1,000 songs and 1,350 fragments. Exhaustive exploration of the possibility of combining Multiscale Entropy Analysis techniques is to be considered in future work.