Keywords

1 Introduction

Content based online music enabling systems are being developed and revamped in order to keep up with expectations of search and browse functionality. These approaches as a group describe the Music Information Retrieval (MIR) systems and have been the area under exhaustive research. The rationale of MIR research is to develop new theory and techniques for processing and searching music databases by its content. The QBH is a special branch of MIR and also a popular content based music retrieval method where the user enters a search query by humming.

Most of the research works on QBH [15] are based on the music processing and focused on components like melody extraction, representation, similarity measurement, size of databases, query and search algorithms. The strong literature supports the symbolic representation for melody in the form of zero-cross detection, energy, Modified Discrete Cosine Transform (MDCT) [6], pitch contour [5], rhythm [7] and quantized pitch change descriptor [3]. Also there is a remarkable amount of research work [810] in the broader areas of similarity measurement with reference to music patterns.

Most of the approaches proposed in the literature are not suited for real-world applications of music retrieval from a large music database. Perhaps, is due to either undue complexity in computation which leads to longer response time or performance degradation; subsequently leading to erroneous retrieval results. Striking a balance between computation and performance is the ultimate goal for such retrieval systems. As a result there are a few speeding up [1113] mechanisms proposed for QBH.

Quite extensive literature [110] is available on QBH system, but there is no significant amount of literature [1418] towards designing filtering procedures. Jang and Lee [15] have projected a mathematical analysis for a two-stage Query by Singing\Humming (QBSH) system, which is the first application of PF to QBSH. In another work authors [19] proposed the concept of iterative deepening Dynamic Time Warping (DTW), which is a special form of PF for speeding up DTW. Improvement in the form of multi phase PF for QBSH without much design analysis is presented in [12,16]. Research work [19], proposes a simplified version of PF with a constant computation time with respect to survival rates for each comparison stage. However, most of the proposed methods still portray the deficit in meticulous investigation, efficiency and effectiveness.

Therefore, in this paper we have proposed to apply PF using MRH approach for QBH system to accomplish the improved retrieval accuracy. Real-world applications of music retrieval symbolize huge amount of non relevant songs with reference to user queries causing input imbalance problem. We expect that these two techniques are most applicable to mitigate the effect of input imbalance. The exhaustive experimentation substantiates the potential of proposed method to construct an effective music retrieval system based on humming input. In this paper, as explained above, we have motivated to use PF as a filtering procedure. So, the next section gives a brief view of PF used for search space reduction. In Sect. 3 we have made diligent discussion on MRH framework for pattern matching in music retrieval systems. While Sect. 4 elaborates the details on similarity measure stratagem for QBH. In Sect. 5, experimental results are presented and discussed. The last section enumerates the conclusion.

2 Progressive Filtering

The inspiration behind PF is to apply a series of comparisons, in which each comparison will select a smaller set that is likely to contain the target of the input query. The process is repeated until final output contains list of songs with appropriate length, say 10 or 20. PF on QBH is performed by applying multiple stages of comparisons between a query and the songs in the database, using an increasingly more complicated recognition mechanism to the decreasing candidate pool. So that the correct song will remain in the final candidate pool with a maximum probability. Intuitively, the initial few stages are quick and impure such that the most unlikely songs in the database are eliminated. On the other hand, the last few stages are more sophisticated and time consuming such that the most likely songs are identified [15].

After each stage of PF, the number of surviving candidates in the candidate pool of the database becomes smaller, and the recognition technique turns into refined and effectual. The final output is the surviving candidate songs at the last stage. The multistage representation of PF is shown in Fig.  1, where there are m stages, corresponding to different comparison methods with varying complexity.

Fig. 1
figure 1

Multistage representation of progressive filtering

For stage i, the input is the query andn i−1 surviving songs from the previous stage. The output of stage i is a reduced set of candidate songs of sizen i  = n i−1 s i for the succeeding stagei + 1. In other words, each stage performs a filtering process that reduces the number of the candidate songs by a factor of the survival rate s i . Each stage is characterized by its capability to select the most likely song candidates as the input to the succeeding stage. For a given stage, this capability can be represented by its recognition rate, which is defined as the probability that the target song of a given query is retained in the output song list of this stage. Intuitively, the recognition rate is a function of the survival rate.

3 Multi-Resolution Histograms

3.1 Essence

Over the past few years Multi-Resolution Analysis (MRA) is receiving major attention by researchers in the domain of computer graphics, geometric modeling, signal analysis and visualization. It is a most important approach for proficiently representing signals at many levels of detail with numerous advantages like compression, different layers of details display and progressive transmission [20]. The term multi-resolution is used in diverse perspective such as multi-resolution based wavelets, subdivisions, hierarchies and multi-grids.

Histograms provide a very effective means of data reduction and depict many attributes of the data like location, spread, and symmetry. It is also possible to decompose music signal and build histograms on the underlying cumulative data distributions. Histograms give better approximation for cumulative data distributions with less space usage. However, histograms provide a comprehensive analysis of the data distribution by excluding sequence details of values. MRH depiction is proposed for enhanced discrimination of music data based on their position fine points to assist effectual QBH system. The music signal is recursively decomposed and cumulative histograms are built. Together all these cumulative histograms of a music signal are remarked as MRH. The selection of number of levelsl is directly proportional to precision. Early phase cumulative histograms exhibit lesser amount of music information than later phases. These early phase MRH are used to provide quick approximate answers to music retrieval queries in the beginning. Later phase of searching with next level MRH gives us better estimates.

In this paper, a MRH based representation is proposed to approximate music signal that is invariant to shifting and scaling. The MRH representation detects existence of a pattern along with shape matching. In the early phases of searching music signals with specific specified pattern are retrieved, then search continues for shape matching yielding result set of music signals that are of interest to the user. The hierarchical MRH framework is shown in Fig.  2. The symbolHR i indicates the histogram representation at leveli.

Fig. 2
figure 2

Multi-resolution histogram representation

3.2 Connotation of Mathematics

Histogram function h i counts the number of samples that fall into each of the disjoint sets known as bins. Thus, ifn is the total number of samples andt is the total number of bins, the histogram function h i is defined as following:

$$ n = \sum\limits_{i = 1}^{t} {h_{i} } $$
(1)

A cumulative histogram function counts the cumulative number of samples in all of the bins up to the specified bin. In particular, the cumulative histogram function hc i of a histogram function h j is specified as:

$$ hc_{i} = \sum\limits_{j = 1}^{i} {h_{j} } $$
(2)

Cumulative frequency distributions authorize users to approximate frequencies over numerous bins. There is no standard value for number of bins, and different number of bins exhibit different features of the samples. Based on the data distribution and the objective of the analysis, different bin widths are chosen. The numbers of bins t are calculated from a recommended bin widthw as:

$$ t = \left[ {\frac{max(S) - min(S)}{w}} \right] $$
(3)

whereS=samples to be histogrammed. Also the equal sized bin widths are found by dividing the range with the number of binst.

Primary objective of our research is to develop search criteria using similarity based queries over one dimensional music signal. Such music signalS is defined as a sequence of values:

$${S} = \left[ {s}_{1} ,{s}_{2} , \ldots {s}_{N} \right]$$
(4)

whereN, the number of samples in S and s i is a vector of values that was sampled at timestampt i . Given a music signal database

$$ {D} = \left\{ {S}_{1} ,{S}_{2} , \ldots {S}_{M} \right\} $$
(5)

and a queryQ, the aim is to find all the music signals inD that contain the specified queryQ as well as histogram shape similar to that ofQ. MRH are constructed by dividing the range [min D , max D ] of music databaseD into t non-overlapping equal size sub-regions, identified as histogram bins. HistogramH s is computed by counting the number of data valuesh i (1 ≤ i ≤ t) that are located in each histogram bin i.

$${H}_{s} = \left[ {h}_{1} ,{h}_{2} , \ldots {h}_{t} \right]$$
(6)

A cumulative MRH is a mapping that counts the cumulative number of observations in all of the bins up to the specified bin. That is, the cumulative histogram HC s of a histogram H s is defined as:

$$ HC_{s} = \sum\limits_{i = 1}^{t} {h_{i} } $$
(7)

MRH at higher levels have enhanced discrimination power; however, the computation of MRH Distance (MRHD) at higher scales is more expensive than those at lower levels. So the number of levels trade-off should be established to balance complexity and precision.

3.3 Proposed Strategy

MRH construction system for database D is depicted in Fig.  2 and steps are shown in the following algorithm 1.

Table 1

4 Similarity Measure Stratagem for Query by Humming

4.1 Multi-Resolution Histograms Distance Measure

In order to recognize the query pattern in the music database, we have attempted to develop a similarity function which separately considers signal frequency as well as positional information. Given a songS of music databaseD and humming queryQ, feature vectors\( H_{{S_{f} }} \) extracted from song MRH are matched with query MRH\( H_{{Q_{f} }} \) by means of the MRHD measure:

$$ MRHD\left( {H_{{S_{f} }} ,H_{{Q_{f} }} } \right)\; = \;\sum\limits_{i = 1}^{t} {min} \left( {H_{{S_{i} }} ,H_{{Q_{i} }} } \right)\; \times \;\frac{{\left( {\sqrt 2 - d\left( {H_{{S_{i} }} ,H_{{Q_{i} }} } \right)} \right)}}{\sqrt 2 } $$
(8)

where

$$ d\left( {H_{{S_{i} }} ,H_{{Q_{i} }} } \right)\; = \;\sqrt {\sum\limits_{i = 0}^{t} {\left( {h_{{s_{i} }} - h_{{q_{i} }} } \right)^{2} } } $$
(9)

is a Euclidean Distance function.

4.2 Database Pruning Using Threshold

MRHD for whole music database is calculated using Eqs.  (8) and ( 9). The average of the MRHD considered as the upper limit and 0 as the lower limit of threshold as shown in Eqs.  (10) and ( 11):

$$ T_{upperlimit} = \frac{1}{M}\left( {\sum\limits_{i = 1}^{M} {MRHD\left( i \right)} } \right) $$
(10)

and

$$ T_{lowerlimit} = 0 $$
(11)

where M = no of songs in the database. Unlikely songs are quickly eliminated by comparing MRHD values of database songs with threshold range. The song whose threshold is not in the range will be eliminated from the pruned database. In other words, if the following condition is not satisfied such song may be purged:

$$ T_{lowerlimit} \le\,MRHD_{S} \le\,T_{upperlimit} $$
(12)

This procedure is carried out at different histogram resolution level to form PF. The database pruning rate analysis is depicted in Fig.  4.

5 Results and Discussions

The relative performance of the proposed QBH method demonstrates several interesting trends and this section is dedicated to evaluate the proposed approach. Substantiation of feasibility of the proposed criteria is done through experimentation. In the sequel, three series of experiments were conducted with corresponding target and query corpus by varying the number of histogram bins from 100 to 1,000 and histogram resolution level from 1 to 5. Finally, comprehensive discussions of performances are portrayed in terms of error rate, database pruning, Mean Reciprocal Rank (MRR), Mean of Accuracy (MoA) and Top X Hit Rate.

5.1 Target Corpus

We are proposing a novel QBH system exclusively for Indian music songs, so the corpus chosen for this study consists of 1,000 Indian Kannada devotional monophonic MP3 songs. This collection is prepared from 39 subjects including songs from 22 males and 17 female singers. The corresponding training set includes a subset of 100, 200, 500 and 1,000 songs for different experiments. MP3 songs contain convoluted melody information and even noise. Thus preprocessing is applied on the MP3 songs database to extract information needed by the system. In music, human vocal part always plays an important role in representing melody rather than its background music therefore it is desired to segregate both [21].

5.2 Query Corpus

For system evaluation, we employ a monophonic query corpus containing total 200 sample queries from ten participants. Each participant was asked to hum beginning of the target song two or three times each. The participants were selected from variety of musical backgrounds like with and without considerable musical training. Also they were instructed to hum each query as naturally as possible using the lyrics of the target corpus.

5.3 Error Rate Analysis

Using the query and target corpus described above, the error rate is computed for the QBH system implementations presented in Sects. 24. Figure  3 displays the error rate for five histogram resolution levels. The target database number of histogram bins is represented along the horizontal axis and the error rate along the vertical axis. As expected, direct comparison of error rates with increasing histogram bin numbers, yields the better performance, this improvement diminishes as the number of bins decrease.

Fig. 3
figure 3

Error rate analysis

Through prominent observation it was found that fine grain level music signal approximation is possible with higher number of histogram bins, which yields better performance. However, error rate increases with the decrease in the number of histogram bins.

5.4 Database Pruning Rate Analysis

Figure  4 displays the pruning rate analysis for QBH system across different sized target databases with five histogram resolution levels. The target database’s number of bins are represented along the horizontal axis and the pruning rate along the vertical axis. In this figure, the pruning rate for histogram resolution level 1, 2, 3, 4 and 5 are shown with a line, dashed line, small dashed line, dash-dot line and dash-dot-dot line respectively.

Fig. 4
figure 4

Database pruning rate analysis

Indeed, for increasing number of histogram bins and histogram resolution levels pruning rate is approximately 55 % as shown in Fig.  4. The first histogram resolution level representation yields the most robust performance of pruning around 55 %. For the target database with increasing number of histogram bins the best pruning rate is in the range 55.21–39.35 % across different histogram resolution levels. That is, the histogram representation with higher number of histogram bins yields good pruning rate, however, it is computationally domineering.

5.5 Performance Analysis

Many different measures for evaluating the performance of QBH systems have been proposed [11,15,18]. The measures require a collection of training and testing samples for each test scenario and parameter combinations. The Mean Reciprocal Rank (MRR) is defined as:

$$ MRR = \frac{1}{n}\sum\limits_{i = 1}^{n} {\frac{1}{{rank\left( {t_{i} } \right)}}} $$
(13)

MRR is a metric for estimating any system that generates list of potential responses to a query. Reciprocal rank of a query outcome is the multiplicative inverse of the rank of the first accurate response. That is, the MRR is estimated as the average of the reciprocal ranks of outcomes for a sample of queries. The reciprocal value of the MRR refers to the harmonic mean of the ranks. In other words frequency of the system estimating one of the first ranks is calculated through MRR [21]. We obtained MRR in the range 16.41–21.34 % for different histogram resolution levels. The proposed strategy reveals that the MRR increases with increase in histogram resolution level as portrayed in Fig.  5. In other words, frequency of occupying top five ranks increases as histogram resolution level increases.

Fig. 5
figure 5

Performance analysis

Similarly for each test scenario and parameter combination the Mean of Accuracy (MoA) is defined as:

$$ MoA = \frac{1}{n}\sum\limits_{i = 1}^{n} {\frac{{n - rank\left( {t_{i} } \right)}}{n - 1}} $$
(14)

It demonstrates the average rank at which the target was found for each query. We obtained MoA in the range 68.84–83.21 % with histogram resolution levels one to five. From Fig.  5, it is found that the MoA decreases with increase in histogram resolution level. This indicates average rank of the retrieved song decreases with higher histogram resolution levels.

The Top X Hit Rate is defined as percentage of successful queries and it can be shown mathematically as:

$$ Top(X) = \# \left\{ {rank(i):rank(i) \le X} \right\}/N $$
(15)

where X symbolize top most songs and N indicates total number of songs. The impact of Top X Hit Rate for different histogram resolution level is portrayed in Fig.  5. The top X Hit Rate varied from 65.78 to 78.90 % for different histogram resolution levels. From the Fig.  5, X value 10 was found to be the best, at which system obtained retrieval accuracy in the range 65.78–78.90 % with increasing histogram resolution level.

Comparing Figs.  3, 4 and 5, the MRH based representations empirically yield relatively better performance in terms of MRR, MoA and Top X Hit Rate.

6 Conclusion

In this work, we have attempted to exploit advantages of MRA technique to progressively reduce search space for QBH applications. In these kinds of applications, initial result set consists of songs that have some specific patterns; subsequent steps perform relatively slow search in the small space to retrieve all songs whose histogram shape matches with query. MRH analysis is employed as database filtering procedure to support iterative search in the database to produce effective music retrievals. The results obtained from exhaustive experimentation are encouraging. Exhaustive exploration of the possibility of combining equal area bin histogram and MRA is to be considered as part of further investigation.