Keywords

1 Introduction

There are millions of songs available for users in online databases. Very often, we would like to listen only songs that belong to a specified music genre. It is nearly impossible to manually assign these millions songs into a music genre. One of the options is to do this automatically. For this task, machine learning methods can be used. It is possible to improve the obtained classification quality, either by pre-processing the dataset or by appropriate selection of classifiers parameters, or by creating an appropriate classifier structure (e.g., the number of layers). However, for a given classifier, at some point, the practical ability for further improving the results is limited.

In the current research, we would like to examine the comparatively simple method – the collection of ensembles. Of course, it is possible to create an ensemble with quite complex classifiers, including a convolutional neural network with many layers [6]. However, the computational cost of using such an ensemble can be high. Deep neural networks take a while to learn. That is why, in current research, we intend to examine the creating of wide ensembles with relatively simple classifiers. Wide ensembles are understood as built with many (dozens) base-level classifiers, in this case, multiple copies of the same classifier. This way, we can obtain another method of improving the final result of the music genres classification.

The main contribution of the paper is the way of creating wide ensembles. It can be done instantly for a neural network by creating multiple copies of the previously prepared classifier. This paper will check the change in the classification quality for this type of structure. The second contribution is checking whether the additional, late input of raw data, connected directly to the concatenation layer, which connects individual classifiers, improves classification quality. Additionally, the influence of depth of the classifiers and application of Principal Component Analysis on the final result is examined.

2 Related Work

The problem of music genre recognition (MGR) [1], as one of the sub-disciplines of music information retrieval, has become an increasingly explored issue in recent years. The article [12] can be considered the beginning of the popularity of the MGR topic [5]. The classification of musical songs can be executed using many machine learning methods. Not only the classical classifiers [2] can be used, but also newer approaches like the deep learning domain [4], with convolutional neural networks (CNN) [8] or convolutional recurrent neural networks (CRNN) [11]. Unfortunately, deep neural networks are more challenging to create, need more time for the learning process, and often give worse classification results (comparative studies are presented in Table 4) or at least there is no guarantee for obtaining a better. There are also studies in which the ensembles of various classifiers are used. Ensembles consist of base-level with a set of classifiers, as well as meta classifier [10] that tries to predict the final result based on outcomes of base-level classifiers.

3 Conditions of Experiments

The dataset that the experiment was conducted is the small subset of Free Music Archive dataset (FMA) [3]. For each excerpt, there are over 500 features in the FMA dataset. This dataset was split into three sets: training, validation, and testing in ratio 80:10:10.

The ensemble in the conducted research is built as multiple copies of the same classifier. However, each of the classifiers is learned independently. Because they have various sets of initial weights, the result of the learning process is also different for each of the classifiers. Consequently, they will generate slightly different classification outcomes. The output of individual classifiers of a given ensemble is fed to the input of the ’Concatenate’ layer of the additional classifier (based on a dense neural network), which generates the final classification result. Additionally, this layer can have neurons for extra input of numerical data.

4 Experiments

4.1 Basic Dense Classifiers

At the beginning, three basic classifiers were created. Their quality (Table 1) can be treated as a benchmark for further tests.

The first one is a simple dense classifier (Fig. 1a) consisting of three layers – input layer and two dense layers. The results can be found in Table 1 in the ’En0’ row. Accuracy of 53% is far better than a blind guess and already looks promising. Each created ensemble has a unique name (number) from En1 up to En14. The current classifier (En0) is the only one which is not an ensemble.

Table 1. Different size of ensemble, without and with raw data input.

The second structure – En1, is the simplest ensemble that consists of only two classifiers, without numerical input connected by predicating layer with classifier outputs merging mechanism. Those two classifiers are just a copy of the simplest classifier.

Next structure En2, two classifiers ensemble with numerical input (Fig. 1b) is almost the same as the previous one. However, the difference is that a raw numerical input is attached to the merging layer.

Comparing the obtained results (Table 1), it turns out that the best approach is the usage of the two classifiers ensemble with numerical input. An additional late raw numerical input (concatenated with outcomes of base-level classifiers) to the two classifiers network ensemble, significantly increased the classification quality of a whole structure. As the results came out to be quite promising, the additional layer of numerical input will be included for the rest of the research.

Fig. 1.
figure 1

a) Simplest classifier b) Ensemble with numerical input of raw data

4.2 Principal Component Analysis Influence

In this part of the research, the Principal Component Analysis (PCA) was introduced. The feature vector can contain not only useful data but also noise and that can lead to worsening results. One of the popular methods of preventing such a problem is feature extraction. Over five hundred features of raw data are currently fed to classifiers. Reducing that number might not only speed up processing time but also increase the accuracy of the proposed network. Time for training for all input data without using PCA took around 6.5 s, while with PCA reduction to 300 components, training took 4 s, and with 20 components, training took only 2.8 s. The best result was achieved (Table 2) for 300 components, and that quantity will be used for the rest of the research. The reduced by PCA set of features are transferred to the classifiers’ inputs and for the late raw numerical input.

Table 2. Accuracy achieved for Principal Component Analysis.

4.3 Influence of the Number of Classifiers

In this part of the research, the influence of the number of classifiers was tested. The three (En3), five (En4), and fifty (En5) classifiers in one ensemble were taken into consideration. The achieved results are presented in Table 1.

It turns out that one additional classifier in the ensemble (En3) did not bring higher results. For another two classifiers added to the model (En4), the results are slightly improved. However, the best result is achieved for a wide ensemble of 50 base-level classifiers (En5), with an accuracy of about 62%. Nevertheless, better results also came with around 20 times longer training time than in the case of two base-level classifiers network. Additionally, the overfitting of the network can be seen. As a result, in the next test, batch normalization and dropout will be introduced.

Table 3. Different size of ensemble and different size of classifier.

4.4 Influence of the Architecture of Base-Level Classifiers

This time not only the number but also the size and structure of classifiers were examined. The results of using such ensembles are presented in Table 3.

The first structure (En6) goes back to the two base-level classifiers with an additional dense and batch normalization. Comparing the obtained outcomes to the ensemble without additional layer and batch normalization (En2) shows a slight improvement in performance quality. In the next ensemble (En7), another dense layer and batch normalization were added. Apart from longer training time, there was no significant change in the classification accuracy by introducing the next dense layer. The structure of the En8 ensemble is similar to the En6, but this time with five base-level classifiers. Interestingly, the quantitative outcomes are slightly worse compared to both ensemble En6 and the earlier En4. However, the En9 ensemble, with another dense layer and batch normalization, easily beat all presented ensembles but En5 (with 50 base-level classifiers). The En10 wide ensemble is similar to the En9 one, however, this time consists of fifty classifiers. The obtained results are the best in the presented studies. A test was also carried out using additional dropout, but the results obtained in this way (En11, En12, and En14), turned out to be worse than En10. The same conclusion is for a wider ensemble (En13) with 200 classifiers but without dropout.

5 Comparison of the Outcomes

Basic classifiers and raw data input. As can be seen in Table 1 even the basic ensemble (En1) improves the results slightly in comparison to the simple classifier (En0). A much more significant improvement is obtained with a late raw numerical data input (En2).

Principal Component Analysis. Introduction of Principal Component Analysis (Table 2) was not strictly related to the model development but to the data preprocessing that the model operated on. The FMA dataset offers an overall of 518 features (for each music track), and data can be preprocessed by the dimensionality reduction method. Here, only PCA was exploited and the best result was achieved for 300 features.

Width of ensemble. Additional base-level classifiers have influenced a significant increase in quantitative results (Table 1). The actual cost of such an increase was only the time of computation. An increase of ensemble width by adding classifiers improved results significantly (up to fifty base-level classifiers). Further expanding the ensemble did not bring any advance, conversely, the outcomes have already started to worsen.

Architecture of base-level classifier. The next way of improving the accuracy of the model was the change in the structure of the base-level classifier (Table 3). Incrementing the depth of the base-level classifier results in higher accuracy values. At the same time, with more layers, the overfitting of the model became more noticeable. To reduce training accuracy spiking, batch normalization and dropout were introduced. Nonetheless, dropout did decrease overfitting but did not help with model accuracy.

Comparison of the quantitative outcomes with state-of-the-art. To compare the best result achieved in this research (wide ensemble En10 with fifty base-level classifiers for which accuracy was 0.658) with other state-of-the-art works [7, 9, 11, 13,14,15] the Table 4 was created.

Table 4. Comparison of different models classifying FMA small dataset with the proposed wide ensemble En10 (all values are in %).

6 Conclusions

The results of the work are auspicious. The ensemble provided satisfying results, with the best model reaching almost 66% accuracy. It is worth mentioning that this value is in the range of state-of-the-art techniques. However, obtaining such a result is only an additional effect. The main aim of the research was to show how to relatively easily improve the originally obtained classification result. This goal has been achieved. This way, by implementing wide ensembles, we obtain another method of improving the final result of the classification without much design or programming effort. A certain limitation may be the increased computation time and the increased demand for computer resources. However, there are no initial restrictions as to the field, dataset, or nature of the research where we can try to use this method.

Summarising, if the main goal is classification quality, presented in the article methods and structures are definitely worth considering.