1 Introduction

The field of MIR has fascinated the research community for a long time and one of its most promising applications has come to us in the form of Automatic Music Transcription (AMT). AMT is the process of identification of the notes played by an Instrument from an audio clip. In a music piece, more than 1 Instrument is played at a time and not all the Instruments are played through the entire length of the piece. Thus it is essential to identify the active regions of the Instruments in a piece before transcription. The challenge of identifying such Instruments increases even more when a piece is accompanied by noise. It is important to be able to identify musical Instruments in isolation from noisy clips before identifying the same from a piece and MISNA is a system proposed towards such a task. The main contribution of our work includes the use of proposed lower dimensional features (LPCC-S) derived from standard LPCC values for minimizing computational overhead and overcoming uneven dimensionality issue, the use of Extreme Learning Machine based classification which is a faster version of standard neural network based classifier, experimentation with various levels and types of noisy environments and verification of the generalization capability of the proposed system for both Individual Instruments and Instrument families using clips of short durations.

2 Related works

Masood et al. [22] identified 5 different Instruments using MFCC and Timbral features with an accuracy of 89.17%. They used a neural network based classifier, which was trained using Conjugate gradient back propagation and Fletcher-Reeves updation technique. Patil et al. [25] classified 15 Instruments with an accuracy of 86.04% using a SVM and concept analysis based technique. Eronen et al. [7] used features based on temporal and spectral properties of sound to classify 30 orchestral Instruments from the bass, string and woodwind families and obtained an accuracy of 94% in identification of the correct Family. A system to identify 7 different Instruments was presented by Sturm et al. [30] using multiscale MFCC based features. A highest accuracy of 84.69% was obtained for the system using a SVM based classifier. Martin et al. [21] used a statistical pattern recognition based approach to identify 15 different orchestral Instruments using acoustic features related to physical properties of source excitation and resonance structure. Accuracies of 90% and 70% were obtained for Instrument Family and identification Individual Instrument identification using Gaussian models and Fisher multiple discriminant analysis. Takashi et al. [31] designed a system to identify 12 musical Instruments using zero crossing, pitch, brightness and spectral centroid based features. They obtained highest average accuracies of 82.1% and 56.2% for the University of Iowa musical Instrument database and RWC music databases using Random Forest and Linear Discriminant Analysis technique respectively. A system to classify 19 different musical Instruments was presented by Kitahara et al. [16] with the help of 18 dimensional features. The feature set was composed of F0 normalized covariance and mean which produced an accuracy of 79.73%. Benetos et al. [2] used various classification techniques to distinguish 20 Instruments with the help of MPEG-7 audio descriptors as well as zero crossing, spectrum flatness, MFCC, auto correlation, spectrum roll off frequency, specific loudness sensation and total loudness and produced accuracies as low as 88.7% and as high as 95.3%. Livshin et al. [19] presented a real time Instrument recognition technique from solos for 7 Instruments. Post 62 dimensional feature extraction, a dimension reduction technique using Gradual Descriptor Elimination was applied to reduce the computational overhead. Accuracies of 88.13% and 85.24% were obtained respectively for the non reduced and reduced sets with the aid of KNN classification and LDA transformed learning set. Kaminskyj and Czaszejko [15] classified 19 Instruments from 9 major and sub families. They extracted 6 features for each namely cepstral coefficients, multidimensional scaling analysis trajectories, constant transform frequency spectrum, RMS amplitude envelope, presence of vibrato and spectral centroid. They obtained a highest accuracy of 97% using KNN classification technique for Family Identification. Lita et al. [17] presented a smart sound sensor based system for the identification of 3 musical Instruments in real time and obtained an average accuracy of 98.33%. Kaminskyj et al. [14] distinguished 4 different Instruments from 4 different families by employing various mechanisms in the pre processing stage including short term RMS energy envelope computation, Principal Component Analysis and Ratio or Product transformations of the same. Artificial Neural network and nearest neighbour based classifiers were applied for the same and accuracies in the range of 93.8% - 100% were obtained. Yu et al. [34] differentiated 14 Instruments from 4 Chinese folk Instrument families and obtained a highest accuracy of 89% by combining perceptron based features along with Mel Scale Cepstral Coefficients. Liu et al. [18] designed a system for identification of 4 Instrument families for both Chinese and Western Instruments. They experimented with various classifiers and features for both Chinese and Western genres and concluded that Spectral Flatness Measure coupled with KNN classifier produced the best result in the case of Chinese Instruments and the same feature coupled with SVM or MFCC coupled with KNN produced the highest accuracy for Western Instruments. They obtained a difference of 28% in the accuracy between the best and worst classification scheme. Agostini et al. [1] presented a system for the identification of 30 musical Instruments from the McGill University Master samples database using spectral features. Various classification techniques encompassing k-Nearest Neighbour, Canonical Discriminant Analysis, Quadratic Discriminant Analysis (QDA) and SVM were applied out of which highest accuracies of 80.2%, 78.6% and 69.7% were obtained for 17, 20 and 27 instruments respectively using SVM with a RBF kernel. They further obtained accuracies of 81% and 92.2% for the 27 instruments family and pizzicato-sustained discriminations respectively using QDA. They also highlighted obtained accuracies of 89%, 94% and 96% using QDA for rock strings, woodwind and brass families respectively as well. Livshin et al. [20] presented algorithms for outlier or bad sample Detection to improve musical Instrument identification. Sliding window of 60 ms along with a 66% overlap were used for calculation of features which helped in successfully discarding 70.1% of the bad samples which generally degrade Instrument recognition performance. Fragoulis et al. [8] designed a system to recognize 2 different Instruments namely guitar and piano using tonal spectral content for clips of average length of 1.8 sec. An accuracy of 100% was obtained for 926 isolated piano notes and 612 similar guitar notes. Röver et al. [29] presented a Hough transformation based approach to identify musical Instruments. They used a hybrid of Linear Discriminant Analysis and Quadratic Discriminant Analysis known as Regularised Discriminant Analysis to identify 25 Instruments and obtained a lowest misclassification rate of 26.1%. Donnelly et al. [6] used different Bayesian Networks to classify 24 different orchestral Instruments. Bayesian networks along with conditional dependencies in the frequency and time domain produced accuracies of 98% and 97% for Individual Instrument and Instrument Family identification. Yu et al. [33] proposed an improved matching pursuit algorithm for the identification of musical Instruments. They extracted atomic parameters for Instruments from the algorithm and fed it to a SVM in order to differentiate 10 musical Instruments and obtained an accuracy of 87.44% in only 1/3rd of the time as required by standard matching pursuit algorithm. Jadhav [12] obtained accuracies of 88%, 84% and 73.33% for 5, 10 and 15 different Instruments with the help of timbral audio descriptors and Binary Tree classifier. Accuracies of 90%, 77% and 75.33% were obtained for the same set using KNN classifier along with MFCC features.

3 Dataset development

One of the most important facets of any experiment is data collection. The database of our experiment was put together with the aid of synthesized tones of 7 different Instruments namely Flute, Grand Piano, Guitar, Saxophone, Harmonium, Violin and Santoor. The Instruments hailed from 3 families namely Wind (Flute and Saxophone), Keyboard (Grand Piano and Harmonium) and String (Violin, Nylon String Guitar and Santoor). Such Instruments were chosen to include both Indian as well as Western Instruments from the various families which are some of the most essential ingredients of melody. All the 22 natural notes in the scale of C from C2 to C5 were played 20 to 30 times for every Instrument in various playing styles including Fortississimo, Fortissimo, Mezzo forte, Forte, Marcato, Staccato, Legato, Pianissimo and Pianississimo. These clips were used to engender 2 datasets (D1 and D2) consisting of 2626 (1 second each) and 1311 (2 seconds each) clips respectively. The clips were stored in uncompressed .wav stereo format at a bitrate of 1411 kbps. The number of clips for both individual Instruments as well as for Instrument Families is presented in Table 1. Each of the presented datasets were used for both the recognition of Individual Instruments as well as Instrument families.

Table 1 Individual Instrument level and Instrument family level details of datasets D1 and D2 along with total (T) clips

Data can be breathed upon by various kinds of noises in real life scenario. To test the performance of our proposed system, each of the datasets (D1 and D2) were contaminated with 4 types of noise sources namely Rain, Traffic, Vacuum Cleaner and Fan which produced 4 X 2 = 8 more datasets whose details are presented in Table 2.

Table 2 Details of the noisy datasets

The Instrument wise Signal to noise Rations (SNRs) for the 1 second long clips (D3-D6) and 2 second long clips (D7-D10) in various noisy conditions for both Individual Instrument level as well as Instrument Family level is presented in Table 3.

Table 3 SNRs for individual instruments and instrument families

3.1 Instruments in the dataset

A brief description of the instruments which were selected in our experiments is presented in the following paragraphs.

Flute::

This is a wind instrument which is also known as Bansuri in India. Flutes can be either side blown or front flown. A flute is capable of producing sounds of different octaves in same fingering position if only the blowing pressure is varied. Flutes are mostly made from bamboo however many musicians use metallic flutes as well.

Guitar::

It is a stringed instrument. Guitars are of various types like Nylon String, Steel String, Bass, etc. A musician needs to strike the strings either with fingers or with the help of a plectrum for producing sound.

Harmonium::

It can be considered as a keyboard instrument due to the presence of keys. It also has reeds which play a vital part in tone production and thus can be considered as a reed instrument as well. The player needs to push and pull the front lever for air circulation within the instrument and at the same time press the keys for producing sound.

Piano::

It is a keyboard instrument. There are various types of Pianos like acoustic grand pianos and and the modern day electric piano. Pianos have evolved into modern day synthesizers which come with various tonal capabilities and other features which has made the task of music production a lot easier.

Santoor::

It is a stringed musical instrument which is trapezoidal in shape. The Santoor is played by striking the strings with two wooden mallets. It is sensitive to glides as well as light strokes. The instrument has tuning pegs mostly on the right for tuning the strings in order to produce sounds of different frequencies.

Saxophone::

It is a wind instrument which is mostly made of brass. A player needs to blow through the mouth piece located at the top and close the holes in various combinations with the help of a key system for producing music. There are various kinds of saxophones like alto, tenor, sorpano, etc.

Violin::

It is a stringed fret less instrument which is played by using a bow. A musician needs to bow with one hand and finger the fingerboard with another hand to produce sound Earlier violins were mostly acoustic but with advent of technology, electric violins are now also available which are mostly used in concerts and recordings.

4 Proposed methodology

The clips were first framed into short sections and then windowed as part of pre processing. Next standard LPCC features were extracted from the clips. In order to tackle the problem of uneven dimensionality, the LPCC-S features were generated for the clips which were then fed to an Extreme Learning based classifier. The proposed system is graphically illustrated in Fig. 1 whose details are presented in the subsequent paragraphs.

Fig. 1
figure 1

Graphical illustration of the proposed system

4.1 Pre-processing

4.1.1 Framing

The spectral properties of a sound signal vary a lot through its entire length thus posing difficulty in the task of analysis. To cope up with this problem, a clip is partitioned into small parts called frames. The spectral properties tend to be quasi stationary within such frames thereby facilitating in the task of analysis. A signal can be framed in 2 ways namely overlapped framing and non overlapped framing. In overlapped framing, a certain number of sample points towards the end intersect with the starting sample points of the next frame. This ensures continuity in between the frames and a smoother transition in between the same. In our experiment, sound signals were framed in overlapping mode with a frame size of 256 sample points and an overlap of 100 sample points. 2 consecutive overlapping frames are graphically illustrated in Fig. 2. The number of obtained frames (m) of size F for a signal consisting of n sample points with O overlapping points can be calculated using (1).

$$ m=\left\lceil\frac{n-F}{O}+ 1\right\rceil $$
(1)
Fig. 2
figure 2

Framing methodology

4.2 Windowing

Post framing, jitters might be observed in them which interfere with the Fourier Transformation of the same in the form of spectral leakage. In order to minimize such problems, the frames are usually multiplied with a windowing function which approaches 0 towards its ends and reaches its peak in the middle. Amidst various such windowing functions, Hamming Window function is one of the popularly used windowing functions whose utility has been presented in [23, 24] which inspired us to use the same in our experiment. The Hamming Window function is mathematically presented in (2) and graphically illustrated in Fig. 3.

$$ w(x)= 0.54-0.46 \cos \left( \frac{2 \pi x}{M-1}\right) $$
(2)

Here w(x) is the Hamming Window function where M is the frame size and x lies in between the start to end of the frame.

Fig. 3
figure 3

Structure of hamming window

4.3 Feature extraction

Twelve standard Linear Prediction Cepstral Coefficients(LPCC) [5, 26] were obtained for every frame of every clip with the aid of Linear Predictive Analysis which predicts a present sound sample as a linear combination of previous sound samples. The mathematical representation of the nth sample, estimated with previous J samples is presented in (3).

$$ s(n)\approx\ A_{1}s(n-1)+A_{2}s(n-2)+A_{3}s(n-3)+A_{4}s(n-4)+....+A_{J}s(n-J) $$
(3)

Here, A1, A2, A3, A4, . . ., AJ are assumed to be constants for an analysis frame which are also known as predictor or linear predictive coefficients which aid in predicting the present sample. The difference between the actual (s(n)) and predicted (\(\hat {s}\)(n)) samples is known as error e(n), which is presented in (4) in terms of the predictor coefficients (Ak s)

$$ e(n)=s(n)-\hat{s}(n)=s(n)- \sum\limits_{k = 1}^{J}A_{k}s(n-k) $$
(4)

In order to engender a unique set of predictor coefficients, error minimization on the sum of squared differences is performed in accordance with (5), where m corresponds to the number of samples in a frame.

$$ E_{n}= \sum\limits_{m}\left[s_{n}(m)- \sum\limits_{k = 1}^{J}A_{k}s_{n}(m-k)\right]^{2} $$
(5)

To solve (5) for the LPs, En is differentiated with respect to each of the Aks as shown in (6)

$$ \frac{\delta E_{n}}{\delta A_{k}}= 0, \hspace{1cm} for\hspace{0.5cm} k = 1,2,3, ....., J $$
(6)

Finally the Cepstral Coefficients are calculated with the recursive procedure as shown in (7).

$$ \left.\begin{array}{ccccccc} C_{0}=log_{e}J \hspace{2.7cm} \\ C_{m}=A_{m}+{\sum}_{k = 1}^{m-1}\frac{k}{m}C_{k}A_{m-k}, for\hspace{0.1cm}1<m<J\hspace{0.1cm}and\\ C_{m}=A_{m}+{\sum}_{k=m-J}^{m-1}\frac{k}{m}C_{k}A_{m-k}, for\hspace{0.1cm}m>J\hspace{0.5cm} \end{array}\right\} $$
(7)

4.4 LPCC-S generation

Since clips of disparate lengths yielded disparate number of frames, so features of variable dimensions were obtained. A clip of 1 second, sampled at 44100 Hz produces 440 frames (256 points wide with 100 point overlap) according to (1). Since 12 LPCC features were extracted for every frame, so a total of 5280 (12 X 440) feature values were obtained. Clips of larger length produced features of even larger dimension which heaped a serious burden on the system in terms of computation.

In order to deal with these 2 issues, LPCC-S (LPCC-Statistical) is proposed whose dimension does not vary with the length of a clip thereby attending to the uneven dimensionality problem and its lower dimension spares the system of computational overhead. Each of the 12 bands of the raw LPCC features for a clip were analysed and the mean for each of those bands were computed followed by Standard Deviation computation of the same. Finally, these values were added to yield a 24 dimensional feature. The LPCC-S generation methodology from LPCC representation of a clip is presented in Algorithm 1.

figure e

A graphical representation of the features of the various Instruments for both 1 and 2 second long clips in Noise Free condition is presented in Fig. 4. It can be observed from the Figure that the feature values of the instruments have different trends which aids in classification.

Fig. 4
figure 4

a Feature values for the Instruments for 1 second long Clips in noise Free Condition. b Feature values for the Instruments for 2 second long Clips in noise Free Condition

The feature graphs for the 1 and 2 second clips in Rain and Fan Noise were analysed as well which is presented in the Appendix. It can be observed from the Figure that the feature values for the various instruments appear to be very similar due to the effect of noise. Moreover not much change can be seen between the values of 1 and 2 second clips for a particular instrument.

The feature graphs for the 1 and 2 second clips in Traffic and Vacuum Cleaner Noise were also analysed which is presented in the Appendix. It can be observed from the Figure that the feature values for the various instruments appear to show some deviations for different length of clips in the Traffic Noise condition. Moreover inter instrument differences are also visible for certain pairs. However, in the case of Vacuum Noise condition the feature values appear to be very close to one another for the various instruments and negligible changes are observed for different length of clips.

4.5 Classification with extreme learning machine (ELM)

Traditional Neural Networks trained using back propagation method have quite a few issues associated with them including a large number of steps involved in the gradient descent searching, local minima, slow convergence, etc. ELM however provides an efficient and unified learning framework by means of generalizing a feed forward neural network with 1 hidden layer only and that too with minimum human intervention for tuning the parameters like the number of nodes and hidden layers [9, 10, 32]. ELMs have the capability of solving an array of classification or regression problems by generating a random learning model which is very fast. In our experiment the number of output neurons was equal to the number of classes for various datasets. The number of hidden neurons was varied from 1 to 600 and was set to the value for which highest accuracy was obtained.

The learning method of ELM involves 2 major steps

Feature mapping

In this stage, the ELM maps the input data to the hidden layer. The output function of this stage is shown in (8).

$$ f(x)= \sum\limits_{i = 1}^{L} \beta_{i} h_{i} (x)=h(x) \beta $$
(8)

where β = ⌈β1......βLT, is the generated weight vector between the hidden layer consisting of L nodes and the output layer consisting of m ≥ 1 nodes. The vector corresponding to the output of the hidden layer is denoted by h(x)=[h1(x).....hL(x)]. The value of hi(x) can be tabulated using (9).

$$ h_{i}(x)=G(a_{i},b_{i},x),a_{i} \varepsilon R^{d},b_{i} \varepsilon R $$
(9)

where, G(a,b,x) corresponds to a continuous, piecewise, non linear function and (ai,bi) corresponds to the parameters of the ith hidden node.

Among various activation functions, sigmoidal function was chosen based on trial runs as it out performed the rest. The sigmoidal function is represented in (10).

$$ G(a,b,x)=\frac{1}{1+exp(-(a*x+b))} $$
(10)

Here, the parameters (a and b) of the output function G(a, b, x) are generated randomly with continuous probability distribution. Thus it can be seen that unlike the feed forward neural networks where the hidden neurons require tuning, those of the ELM are randomly generated. The function h(x) does the work of mapping d-dimensional input data to the L-dimensional random hidden layer in which the parameters of the hidden nodes are generated randomly. So, this feature mapping (hG) is random in nature.

ELM learning:

In comparison to the various traditional learning techniques, the extreme learning technique states that no adjustment is required in terms of the hidden neurons. The target is to simultaneously achieve the smallest training error and smallest norm output weights.

The Universal approximation [11, 32] is satisfied by the ELM which is shown in (11). It holds with a probability of 1 for proper output weights (β). A 5 Fold cross validation technique was used in the current experiment for evaluating the system.

$$ Lim_{L \rightarrow \infty} \parallel \sum\limits_{i = 1}^{L} \beta_{i}j_{i}(x)-f(x)= 0 \parallel $$
(11)

4.6 Statistical significance test

Statistical Significance Test was performed with the robust non-parametric Friedman test [4] for the purpose of comparing various popular classifiers for Pattern Recognition problems encompassing BayesNet [28], SVM [27], Naive Bayes [13] and RBF [3]. The number of datasets (N) and number of classifiers (k) were fixed at 3 and 5 respectively, which implies that each dataset was split into 3 parts. Since the noisy datasets for both 1 and 2 second clips for the Individual Instrument as well as Instrument Family levels were engendered by subjecting the clean datasets to various kinds of noises, so the tests were carried out on the 4 clean datasets (2 Individual Instrument level and 2 Instrument Family level) which are the base datasets of our experiment. The accuracies of each of the classifiers for each of the parts was recorded followed by assignment of a rank (\({R^{i}_{j}}\)) in descending order. \({R^{i}_{j}}\) signifies Rank of jth classifier for ith part). The mean rank of a classifier for the 3 parts was then calculated with the aid of (12).

$$ R_{j} = \frac{1}{N} \sum\limits_{i} {R^{i}_{j}} $$
(12)

Table 4 presents the accuracies and rank distributions for the various parts of the datasets D1 and D2 in the Individual Instrument level. It can be observed from the Table that highest accuracies of 98.29% and 98.70% were obtained for the 1st and 3rd parts of D1 and D2 respectively using ELM. Lowest accuracies of 64.53% and 54.02% were obtained for the 1st and 2nd parts of the respective datasets for LibSVM.

Table 4 Rank Distribution (R) and accuracies (A) for the parts of D1 and D2 at Individual Instrument level

Table 5 presents the accuracies and rank distributions for the various parts of the datasets D1 and D2 in the Instrument Family level. It can be observed from the Table that highest accuracies of 99.77% and 100.00% were obtained for the 1st parts of D1 and D2 respectively using ELM. Lowest accuracies of 75.86% and 78.39% were obtained for the 2nd parts of the respective datasets for Naive Bayes based classification.

Table 5 Rank Distribution (R) and accuracies (A) for the parts of D1 and D2 at Instrument Family level

The Null hypothesis states that the equivalence of all classifiers (∀ j, Rj) is same. In order to verify the same for our experiment, the Friedman Statistic (\({\chi ^{2}_{F}}\)) [4] was calculated with the help of (13). The set of critical values for (\({\chi ^{2}_{F}}\)) (distributed in accordance with k-1 degrees of freedom) depicts that the value of (\({\chi ^{2}_{F}}\)) for 4 (k-1) degrees of freedom along with significances (α) of 0.05 and 0.10 are 9.488 and 7.779 respectively. The calculated values of (\({\chi ^{2}_{F}}\)) for the sets is shown in Table 6 which depicts that the value of (\({\chi ^{2}_{F}}\)) varies significantly and thus rejects the Null Hypothesis.

$$ {\chi^{2}_{F}}= \frac{12N}{k(k + 1)} \left[ \sum\limits_{j} {R^{2}_{j}}- \frac{k(k + 1)^{2}}{4} \right] $$
(13)
Table 6 Values of Friedman’s Statistic for the Datasets

As per post hoc test, Nemenyi’s test [4] was carried out for comparing each of the classifier pairs. Any two classifiers can be regarded as significantly different performers if their average ranks differ by at least the critical difference (CD), which is calculated using (14). The values of q0.05 and q0.10 for 5 classifiers in the case of Nemenyi’s test are 2.728 and 2.459 respectively [4] which led to CDs of 3.52 and 3.17 respectively. It was found that similar CD values were obtained for both the datasets at Individual Instrument level which is presented in Table 7 (upper diagonal) with the significantly different pair CD value highlighted in green.

Table 7 Results of Nemenyi’s Test on D1 and D2 at Individual Instrument level and Instrument Family level for q0.05 and q0.10

In the case of Instrument Family level slightly different CD values were obtained for the classifier pairs for D1 and D2 which is shown in Table 7 (lower diagonal in the order of D1/D2) with the significantly different pair CD value highlighted in blue.

$$ CD= q_{\alpha} \sqrt[]{ \frac{k(k + 1)}{6N} } $$
(14)

Bonferroni-Dunn [4] test was performed on the datasets to compare the performance of ELM (control classifier) along with the other classifiers. The computational and evaluation procedure of Bonferroni-Dunn’s test is similar to that of Nemenyi’s test. It is only the values of q0.05 and q0.10 which differ (2.498 and 2.241 respectively) which lead to CDs of 3.22 and 2.89 for the respective significance levels [4]. The calculated values of CD for the classifier pairs for the sets is presented in Table 8 for significance levels of 0.05 and 0.10 respectively. The CDs of the significantly different pairs are highlighted in blue and green for the respective significance values.

Table 8 Results of Bonferroni-Dunn’s Test for the Datasets at q0.05 and q0.10

5 Result and discussion

The experiments were performed with the aid of a desktop having 16 GB of RAM, along with an I7 processor and Windows 10 operating system. In both the types of datasets, the highest accuracies were obtained in the case of noise Free scenarios. The results in the case of various noisy scenarios is presented and analysed in detail in the subsequent paragraphs. The analysis has been done in 2 phases for presenting a clear picture of the outcome of the experiments. In the 1st phase, the obtained results for the various datasets at Individual Instrument level is discussed. The 2nd phase casts light on the results obtained for Instrument Family level.

5.1 Individual instrument level

The obtained accuracies for the various datasets along with the number of Hidden neurons is presented in Table 9. It can be observed from the Table that in the noise free scenario, the highest accuracy was obtained for D1 which is the overall highest among all the experiments. In case of the various noisy scenarios, the accuracies improved significantly on doubling the length of the clips (from D1 to D2). In case of the noisy sets, the highest and lowest accuracies were obtained for the Fan noise scenario and Vacuum Cleaner noise scenario respectively. In the case of 1 second long clip datasets, the performance of the system on the Traffic noise dataset was better than that of the Fan noise dataset which flipped in the case of the 2 second long datasets. An increase in the overall accuracy for all the noisy sets was observed from datasets of 1 second long clips to 2 second long clips. Analysis of the accuracies for those sets reveal that accuracy gains of 3.56%, 5.16%, 2.41% and 4.87% were obtained for the Fan noise, Rain noise, Traffic noise and Vacuum Cleaner noise sets respectively.

Table 9 Obtained Accuracies for various Datasets at Individual Instrument level as well as Instrument Family level using ELM along with number of neurons in the Hidden Layer

The Instrument wise accuracies for the various datasets encompassing both the 1 and 2 second long clips is presented in Table 10. It can be observed from the Table that a slightly better performance for Flute was obtained using 1 second long clips rather than 2 second long clips as observed in other instruments in Noise free scenario. One reason for this may be the sensitivity of the instrument to blowing techinique as well as ambient air pressure. In the case of noisy sets, best results for Santoor, Violin and Harmonium were obtained for Fan Noise scenario while Flute, Guitar and Piano were most successfully identified in Rain Noise scenario. The best performance for Saxophone was obtained in Traffic Noise scenario.

Table 10 Accuracy for the Individual Instruments and Instrument Families for 1 and 2 second long clips

The comparison of the confusions among the various Instrument pairs for the various datasets in the case of both 1 second (1s) and 2 second (2s) long clips was performed. The confusion matrices are available in the Appendix. The Instruments - Flute, Saxophone, Guitar, Santoor, Violin, Harmonium and Piano are numbered from 1-7 respectively for easier accommodation of the Tables.

It can be observed from the Tables that the highest misclassification for 1 second long clip datasets occurred in the case of Vacuum Cleaner noise scenario where Violin was classified as Piano. In the case of 2 second long clip sets, the highest misclassification was found in the case of Vacuum Cleaner and Rain noise scenarios where Flute was classified as Piano. The highest Individual accuracy in the case of noisy sets was obtained for Guitar in the case of both Fan and Rain noise scenarios for 1 second clip sets and Rain noise scenario among the 2 second long clip sets. The Lowest Individual accuracies were obtained for Santoor in the Rain noise scenario among the 1 second clip datasets and Harmonium in the case of Vacuum Cleaner noise scenario among the 2 second clip sets.

A comparison of the performance of MISNA with some of the systems reported in literature for the identification of Individual Instruments is presented in Fig. 5. Though the compared systems are heterogeneous in the thick of datasets but still they are compared for the sake of a graphical representation of their relative accuracies. The compared works are discussed in Section 2.

Fig. 5
figure 5

Compariosn of MISNA with some of the existing systems based on Individual Instrument Identification with the Highest Accuracy highlighted in Red

5.2 Instrument family level

The obtained accuracies for the various datasets along with the number of Hidden neurons is presented in Table 9. It can be observed from the Table that in the noise free scenario, the highest accuracy was obtained for D2. In case of the various noisy scenarios, the accuracies improved significantly on doubling the length of the clips (from D1 to D2). In case of the noisy sets, the highest and lowest accuracies were obtained for the Traffic noise scenario and Vacuum Cleaner noise scenario respectively. An increase in the overall accuracy for all the noisy sets was observed from datasets of 1 second long clips to 2 second long clips. Analysis of the accuracies for those sets reveal that accuracy gains of 1.23%, 4.32%, 3.40% and 4.31% were obtained for the Fan noise, Rain noise, Traffic noise and Vacuum Cleaner noise sets respectively.

The Instrument Family wise accuracies for the various datasets encompassing both the 1 and 2 second long clips is presented in Table 10. It can be observed from the Table that a fractionally higher accuracy was obtained for 1 second long clips in the case of Keyboard family in contrast to the others in Noise free condition. A probable reason for this could be the effect of fade out and fade in of the notes. In the case of noisy scenario, the best results for all the 3 families were obtained for the Traffic Noise dataset.

The comparison of the confusions among the Instrument Families for the various datasets in the case of both 1 and 2 second long clips was also performed. The confusion matrices are available in the Appendix. The Families - Wind, String and Keyboard are numbered from 1-3 respectively for easier accommodation of the same.

It can be observed from Tables that the highest misclassification for both 1 and 2 second long clip datasets occurred where String Family was classified as Wind Family in the case of Rain noise scenario and Fan noise scenario respectively. The highest Individual accuracy for both the type of sets was obtained for Keyboard Family in the Traffic noise scenario. The Lowest Individual accuracies were obtained for String Family in the Rain noise scenario among the 1 second clip datasets and Keyboard Family in the case of Vacuum Cleaner noise scenario among the 2 second clip sets.

A comparison of the performance of MISNA with some of the systems reported in literature for the identification of Instrument Family is presented in Fig. 6. Though the compared systems are heterogeneous in the thick of datasets but still they are compared for the sake of a graphical representation of their relative accuracies. The compared works are discussed in Section 2.

Fig. 6
figure 6

Comparison of MISNA with some of the existing systems based on Instrument Family Identification with the Highest Accuracy highlighted in Red

6 Conclusion

MISNA is a system which is designed for identification of Individual Instruments as well as Instrument families from audio clips in both clean as well as noisy environments. The system has been tested for various Noisy scenarios with SNRs as low as -9.63 and encouraging accuracies for both type of identifications have been obtained. The system uses a new low dimensional feature namely LPCC-S which overcomes some of the shortcomings of standard LPCC features like uneven as well as large dimensionality. Extreme Learning based classification has also been used in the proposed work which makes the system lightweight in terms of computation due to its ability of generating randomised models. In future, we plan to use various pre processing techniques before feature extraction to filter out noise from the clips as well as for instrument activity detection. Various Feature Dimensionality Reduction techniques will also be experimented with for further dimensionality reduction of the proposed feature in future. We also plan to use other features and classification techniques and test our proposed system on a larger database comprising of a larger number of Instruments.