1 Introduction

With the rapid development of multimedia and communication technology, digital music is seen everywhere in life. The volume of music resources has become more extensive, and the retrieval of massive resources that rely on humans has become laborious. Therefore, music information retrieval (MIR) has become a challenging problem[3, 17]. The algorithm will play an essential role in MIR if it can automatically divide music into different genres according to the music content. Therefore, an accurate and effective music genre classification (MGC) algorithm is indispensable [18-20]. In the traditional method, MGC is mainly composed of two parts: (1) feature extraction; (2) classifier model. Feature extraction expresses the inherent properties of music as feature vectors, and the classifier model maps features vectors to different genres. Baniya and Lee [21] used two different types of features, tone texture, and rhythm content features, to represent music. They used Extreme Learning Machine (ELM) [7] with bagging as the classifier for classifying the genres. Arabi [22] proposed capturing the high-level concepts of music, harmonics, pitch, and rhythmic content feature combined with low-level features and Support Vector Machines (SVM) [23] as a classifier. Sarkar [24] used Empirical Mode Decomposition (EMD) to capture tonal characteristics in the mid-frequency range and used a multilayer perceptron (MLP) [6] as a classifier. Although these methods have contributed much to MGC, they all rely on hand-crafted features for classification. This requires researchers have professional music knowledge to design more worked features.

Deep learning has made breakthroughs in Natural Language Processing (NLP) and Computer Vision (CV) in recent years [8, 10, 25, 26]. The advantage of deep learning is that it provides an end-to-end learning mode, so it does not need to design features separately. Therefore, the works of [152927-] try to apply Convolutional Neural Networks (CNN) in deep learning to audio classification. It is worth noting that Low-level audio features [4] of short-time Fourier transform spectrogram (STFT) [30] and Mel-spectrogram are particularly widely used. In MGC, Zhang [31] proposed a method based on Convolutional Neural Network combined with pooling and short connection [26] to apply to MGC. To capture the temporal dependence of audio, Choi [32] proposed a Convolutional Recurrent Neural Network (CRNN) for music classification. Yu [14] found that: spectra with different temporal steps have different importance. Therefore, they proposed a new model incorporating with attention mechanism based on Bidirectional Recurrent Neural Network [33] and discussed the influence of serial attention and parallel attention. The above methods based on CNN and attention mechanism consider such factors as the temporal dependence of audio and spectral importance in different time steps.

However, there is a strong correlation between temporal frames of sounding at any moment and sound frequencies of all vibrations in the Mel-spectrogram. This can be easily found by observing the Mel-spectrogram (as shown in Fig. 1), choosing a temporal frame randomly in the time domain, and there is a vibrating sound frequency in the vertical direction of the temporal frame. Similarly, choosing a sound frequency randomly in the frequency domain, and there are sounding temporal frames in the horizontal direction of the sound frequency. Therefore, we proposed parallel channel attention (PCA) to build a global time–frequency correlation. Specifically, PCA constructs a weight matrix to obtain global feature correlation, weight and sum the time–frequency information in the Mel-spectrogram and generate new features to build global time–frequency dependencies. From horizontal direction observed, Mel-spectrogram represents time domain figure, and from vertical direction observed, Mel-spectrogram represents frequency domain figure. Therefore, we discuss the influence of weighting methods based on time domain, frequency domain, and time–frequency domain building global time–frequency dependencies for attention mechanism. In addition, when CNN extracts feature, the importance of time–frequency information of each channel for feature map is different. We design dual parallel attention (DPA) composed of PCA and SE Attention [34], which focuses on global time–frequency dependencies in the song and adaptively distinguishes importance of different channels. The main contributions of this paper are summarized as follows:

  1. (1)

    We propose parallel channel attention, which builds global time–frequency dependencies in the song by representing the correlation between temporal frames of sounding at any moment and sound frequencies of all vibrations.

  2. (2)

    We discuss the influence of weighting methods based on time domain, frequency domain, and time–frequency domain building global time–frequency dependencies for attention mechanism.

  3. (3)

    We design dual parallel attention focuses on global time–frequency dependencies in the song and adaptively calibrates contribution of different channels to feature map.

Fig. 1
figure 1

Mel-spectrograms of four music genres, Pop, Rock, Jazz, and Classical, in the GTZAN dataset, where all temporal frames from left to right constitute the time domain, and all sound frequencies from top to bottom constitute the frequency domain

The rest of this paper is organized as follows: Sect. 2 introduces music genre classification and attention mechanism related works. Section 3 describes model of proposed parallel attention applied in CNN-5, including CNN-5, parallel channel attention, SE Attention, and dual parallel attention. Section 4 shows Dataset and experimental setup. Section 5 analysis experimental results of GTZAN dataset. Finally, Sect. 6 concludes the paper.

2 Related work

In audio and signal classification, deep learning is widely used. Yang [35] proposed duplicate convolutional layers whose output will be applied to different pooling layers and concatenated features after each pooling layer, providing more classification statistics. Chang [36] learned 2D representations from 1D raw waveform signals as input feature. Meanwhile, they proposed a new network architecture—MS-SincResNet, which can learn 1D and 2D convolutional kernels together. Choi [37] proposed a transfer learning method for MGC. They used pre-trained convolutional network features to perform music labeling. Then transfer [13] to classification task related to music. Cai [2] proposed a novel music classification framework incorporating the auditory image feature with traditional acoustic features and spectral feature. Srinivasu [38] uses the deep learning model based on finetuned AlexNet to classify the signals associated with glucose levels in the human body. Scalvenzi [11] proposed multiresolution analysis based on discrete wavelet-packet transform (DWPT) associated with a support vector machine (SVM) to classify music singals, such as major and minor chords.

In order to capture global dependencies between input and output across distances, Vaswani [25] proposed Transformer, a model architecture entirely relies on an attention mechanism. Inspired by the classical non-local means method in CV, Wang [39] proposed a non-local neural network (Non-local), which calculates the weighted sum of all position features as the response of a position. Wang [40] proposed parallel temporal-spectral attention based on the time–frequency domain properties of the spectrogram, which enhances the temporal and spectral features by capturing the importance of different time frames and frequency bands. Huang [41] proposed an end-to-end attention-based deep feature fusion (ADFF) approach for music emotion recognition to learn affect-salient features. Dosovitskiy [42] believes that the reason for the superior performance of the attention mechanism is that the Transformer structure plays a decisive role, so they put forward the Vision Transformer (ViT), which has made a new breakthrough in CV. Gong [43] applied Transformer structure in the direction of audio classification and proposed Audio Spectrum Transformer (AST).

Recently, the attention mechanism has been applied to MGC. Yang [44] continue to study the global dependencies of long audio sequences, employing parallel structures instead of recurrent architecture, multi-head attention as feature extractors, and SVM as classifiers. Their models show considerable generalization ability. In the MGC, few researchers focus on the time–frequency correlation in songs. We proposed parallel channel attention, which builds global time–frequency dependencies in the song and merges the parallel channel attention with SE Attention to form dual parallel attention.

3 Proposed method

In this section, we first introduce the basic network CNN-5, then describe parallel channel attention that builds global time–frequency dependencies, then show SE Attention, and finally present the dual parallel attention that fuses SE Attention with parallel channel attention.

3.1 Overview

The general form of song is waveform signal, which is converted into Mel-spectrogram through short-time Fourier transform and Mel filter. There is a strong correlation between temporal frames of sounding at any moment and sound frequencies.

However, there is a strong correlation between temporal frames of sounding at any moment and sound frequencies of all vibrations in the Mel-spectrogram. Moreover, the importance of time–frequency information in each channel is different. Therefore, we propose dual parallel attention, which focuses on global time–frequency dependencies in the song and adaptively calibrates contribution of different channels.

The overall architecture of the proposed DPA approach is shown in Fig. 2: First, the input of the proposed model is Mel-spectrogram. Secondly, the backbone network of the model is CNN-5. The DPA proposed in this paper will be applied to the Attention module in the backbone network. CNN-5 captures the local features in the Mel-spectrogram, and DPA builds the global feature dependencies. As shown in Fig. 3, DPA is mainly divided into two parts. The upper part is parallel channel attention for building global time–frequency dependencies, and the lower part is SE Attention for constructing the global channel dependencies. Finally, the features captured by the backbone network are sent to the full-connection layers, which map the features into genre classes as the output results.

Fig. 2
figure 2

The overall architecture of applying attention mechanism in CNN-5"

Fig. 3
figure 3

Dual parallel attention, where the top part is PCA, and the bottom part is SE Attention. Here, \(\sigma \left(\cdot \right)\) denotes sigmoid function, \({\text{Conv}}\) denotes convolutional layer, \(\otimes\) denotes matrix multiplication, \(\oplus\) denotes matrix sum and \(\odot\) denotes element-wise product

3.2 CNN-5

This section will introduce CNN-5, which mainly consists of five convolutional layers. First, the small channel convolutional layers capture low-level features such as texture and contour. With deeper layers, the number of channels is increased to gather deep semantic feature. The architecture parameters are as follows:

  • Layer 1: The first convolutional layer, consisting of 64 kernels with a 77 respective field and stride of (1, 1), sets a large respective field to aggregate more spectrogram feature. Next, max-pooling with stride (2, 2) is used for down-sampling, capturing critical information in the pooling block.

  • Layer 2: The second convolutional layer consists of 128 kernels with a 33 respective field and stride (1, 1). Down-sampling is done by avg-pooling with stride (2, 2), which captures All information in the pooling block.

  • Layer 3: The third convolutional layer consists of 256 kernels with a 33 respective field and stride of (1, 1). Down-sampling with avg-pooling which stride of (2, 2).

  • Layer 4: The fourth convolutional layer consists of 256 kernels with a 33 respective field and stride of (1, 1). Down-sampling with avg-pooling which stride of (2, 2).

  • Layer 5: The fifth convolutional layer of 256 kernels with a 33 respective field and stride of (1, 1). Finally, global average pooling [45] for down-sampling.

Each convolutional layer follows with Batch Normalization (BN) [46] to speed up the network's training. The activation function is Rectified linear units (ReLU) [47]. Detailed parameters are shown in Table 1.

Table 1 CNN-5 network structure

3.3 Parallel channel attention

There is a strong correlation between temporal frames of sounding at any moment and sound frequencies of all vibrations. We propose parallel channel attention (Fig. 4) to build global time–frequency dependencies. PCA constructs a weight matrix for each channel in the feature map to obtain global feature correlation, weights and sums time–frequency information of each channel in parallel and generates new features to build global time–frequency dependencies. Given input feature map \(\mathbf{X}\in {\mathbb{R}}^{\mathrm{c}\times f\times t}\), \(c\) denotes the number of channels, \(f\) denotes the number of Mel filter banks in the frequency domain, and \(t\) denotes the number of temporal frame in the time domain(actually \(f\) and \(t\) denote height and width in feature map). The feature map \(\mathbf{X}\) has \(c\) channels, denotes, \(\mathbf{X}=\left({\mathbf{x}}_{1},{\mathbf{x}}_{2},\dots ,{\mathbf{x}}_{c}\right)\) where \({\mathbf{x}}_{i}\in {\mathbb{R}}^{f\times t}\), \(i\in \left\{1,2,\dots ,c\right\}\), \({\mathbf{x}}_{i}\) represents a channel and is also a feature set of Mel-spectrogram, composed of two dimensions: time domain and frequency domain. This section describes parallel channel attention based on time domain, frequency domain, and time–frequency domain.

Fig. 4
figure 4

Parallel channel attention, which constructs a weight matrix to obtain global feature correlation, weights and sums time–frequency information and generates new features to build global time–frequency dependencies

Parallel channel attention based on time domain weighting, as shown in Fig. 5: first, building a time domain weight matrix, then, weighting and summing all sounding temporal frames at each sound frequency of Mel-spectrogram in parallel. At last, aggregating time-domain features across distances to build global time–frequency dependencies, as follows:

$${\mathrm{F}}_{PCA\_T}\left(\mathbf{X}\right)=g\left({\left(\sigma \left({\left(\phi \left(\mathbf{X}\right)\right)}^{\mathrm{\rm T}}\psi \left(\mathbf{X}\right)\right){\left(\theta \left(\mathbf{X}\right)\right)}^{\mathrm{\rm T}}\right)}^{\mathrm{\rm T}}\right)+\mathbf{X}$$
(1)

where three 1 \(\times\) 1 2D convolutions include \(\theta\), \(\phi\) and \(\psi\), a 3 \(\times\) 3 2D convolution \(g\) and Sigmoid function denotes \(\sigma\). \(\phi\) and \(\psi\) reduce the number of channels in feature map and the number of parameters during operation. Transpose \(\phi \left(\mathbf{X}\right)\) to get \({\left(\phi \left(\mathbf{X}\right)\right)}^{\mathrm{\rm T}}\) and multiply with \(\psi \left(\mathbf{X}\right)\) to get time domain weight matrix. We only select a channel \({\mathbf{x}}_{i}\) of \(\psi \left(\mathbf{X}\right)\) and \({\left(\phi \left(\mathbf{X}\right)\right)}^{\mathrm{\rm T}}\) to express this process, as follow:

$$\psi \left({\mathbf{x}}_{i}\right)=\left({\mathbf{w}}_{{\psi }_{1}}{\mathbf{y}}_{1},{\mathbf{w}}_{{\psi }_{2}}{\mathbf{y}}_{2},\dots ,{\mathbf{w}}_{{\psi }_{t}}{\mathbf{y}}_{t}\right)$$
(2)
$${\left(\phi \left({\mathbf{x}}_{i}\right)\right)}^{\mathrm{T}}={\left({\mathbf{w}}_{{\phi }_{1}}{\mathbf{y}}_{1},{\mathbf{w}}_{{\phi }_{2}}{\mathbf{y}}_{2},\dots ,{\mathbf{w}}_{{\phi }_{t}}{\mathbf{y}}_{t}\right)}^{\mathrm{T}}$$
(3)

where \({\mathbf{w}}_{{\psi }_{j}}{\mathbf{y}}_{j}\in \psi \left({\mathbf{x}}_{i}^{t}\right)\), \({\mathbf{w}}_{{\phi }_{j}}{\mathbf{y}}_{j}\in {\left(\phi \left({\mathbf{x}}_{i}^{t}\right)\right)}^{\mathrm{\rm T}}\), \(j\in \left(1,2,.\dots ,t\right)\), \(\mathbf{w}\) denotes 1D convolution kernel, \({\mathbf{x}}_{i}\) denotes a channel in the feature map \(\mathbf{X}\), and \({\mathbf{y}}_{j}\) denotes a temporal frame on channel \({\mathbf{x}}_{i}\).

Fig. 5
figure 5

PCA based on time domain. Sigmoid denotes activation for each element in the weight matrix. The blue circular box denotes 2D convolution, reducing the number of channels in the feature map

The weight matrix obtained by multiplying \(\psi \left({\mathbf{x}}_{i}\right)\) and \({\left(\phi \left({\mathbf{x}}_{i}\right)\right)}^{\mathrm{T}}\) denotes the correlation between any two temporal frames on channel \({\mathbf{x}}_{i}\) as follows:

$${\left(\phi \left({\mathbf{x}}_{i}\right)\right)}^{\mathrm{T}}\psi \left({\mathbf{x}}_{i}\right)=\left(\begin{array}{ccc}{\mathbf{w}}_{{\phi }_{1}}{\mathbf{y}}_{1}{\mathbf{w}}_{{\psi }_{1}}{\mathbf{y}}_{1}& \dots & {\mathbf{w}}_{{\phi }_{1}}{\mathbf{y}}_{1}{\mathbf{w}}_{{\psi }_{t}}{\mathbf{y}}_{t}\\ \dots & {\mathbf{w}}_{{\phi }_{j}}{\mathbf{y}}_{j}{\mathbf{w}}_{{\psi }_{j}}{\mathbf{y}}_{j}& \dots \\ {\mathbf{w}}_{{\phi }_{t}}{\mathbf{y}}_{t}{\mathbf{w}}_{{\psi }_{1}}{\mathbf{y}}_{1}& \dots & {\mathbf{w}}_{{\phi }_{t}}{\mathbf{y}}_{t}{\mathbf{w}}_{{\psi }_{t}}{\mathbf{y}}_{t}\end{array}\right)$$
(4)

[16] proposed to employ Sigmoid as a scaling function in audio signals, which can avoid the concentration of attention on several temporal frames. Therefore, Sigmoid is the scaling function in this paper, as follows:

$$\sigma {\left(\phi \left({\mathbf{x}}_{i}\right)\right)}^{\mathrm{T}}\psi \left({\mathbf{x}}_{i}\right)=\sigma \left(\begin{array}{ccc}{\mathbf{w}}_{{\phi }_{1}}{\mathbf{y}}_{1}{\mathbf{w}}_{{\psi }_{1}}{\mathbf{y}}_{1}& \dots & {\mathbf{w}}_{{\phi }_{1}}{\mathbf{y}}_{1}{\mathbf{w}}_{{\psi }_{t}}{\mathbf{y}}_{t}\\ \dots & {\mathbf{w}}_{{\phi }_{j}}{\mathbf{y}}_{j}{\mathbf{w}}_{{\psi }_{j}}{\mathbf{y}}_{j}& \dots \\ {\mathbf{w}}_{{\phi }_{t}}{\mathbf{y}}_{t}{\mathbf{w}}_{{\psi }_{1}}{\mathbf{y}}_{1}& \dots & {\mathbf{w}}_{{\phi }_{t}}{\mathbf{y}}_{t}{\mathbf{w}}_{{\psi }_{t}}{\mathbf{y}}_{t}\end{array}\right)$$
(5)

feature map \(\mathbf{X}\) reduces number of channels through \(\theta\) and also only selects a corresponding channel, \(\theta \left({\mathbf{x}}_{i}\right)\), as follows:

$$\theta \left({\mathbf{x}}_{i}\right)=\left({\mathbf{w}}_{{\theta }_{1}}{\mathbf{y}}_{1},{\mathbf{w}}_{{\theta }_{2}}{\mathbf{y}}_{2},\dots ,{\mathbf{w}}_{{\theta }_{t}}{\mathbf{y}}_{t}\right)$$
(6)

\({\mathbf{w}}_{{\theta }_{j}}{\mathbf{y}}_{j}\in \theta \left({\mathbf{x}}_{i}^{t}\right)\), transpose \(\theta \left({\mathbf{x}}_{i}\right)\) and multiply with \(\sigma {\left(\phi \left({\mathbf{x}}_{i}\right)\right)}^{\mathrm{T}}\psi \left({\mathbf{x}}_{i}\right)\), which makes all temporal frames in \(\theta \left({\mathbf{x}}_{i}\right)\) multiply and add corresponding to each row of the weight matrix, aggregate into new features, to complete the operation of capturing time domain features across distances, to build global time–frequency dependencies, as follows:

$$\sigma \left({\left(\phi \left({\mathbf{x}}_{i}\right)\right)}^{\mathrm{T}}\psi \left({\mathbf{x}}_{i}\right)\right){\left(\theta \left({\mathbf{x}}_{i}\right)\right)}^{\mathrm{T}}=\left({\mathbf{y}}_{1},{\mathbf{y}}_{2},\dots ,{\mathbf{y}}_{t}\right)$$
(7)

\(\sigma \left({\left(\phi \left({\mathbf{x}}_{i}\right)\right)}^{\mathrm{T}}\psi \left({\mathbf{x}}_{i}\right)\right){\left(\theta \left({\mathbf{x}}_{i}\right)\right)}^{\mathrm{T}}\in {\mathbb{R}}^{c}\), Then, the original channel shape is restored through \(g\), and the input and output shapes are kept consistent. Finally, shortcut connections are added to the attention module to avoid losing the original information.

Parallel channel attention based on frequency domain weighting: first, build a frequency domain weight matrix, then weight and sum all sound frequencies in each temporal frame in parallel. And last, aggregate frequency domain features across distances to build global time–frequency dependencies, as follows:

$${\mathrm{F}}_{PCA\_F}\left(\mathbf{X}\right)=g\left(\sigma \left(\phi \left(\mathbf{X}\right){\left(\psi \left(\mathbf{X}\right)\right)}^{\mathrm{T}}\right)\theta \left(\mathbf{X}\right)\right)+\mathbf{X}$$
(8)

Parallel channel attention based on time–frequency domain is composed of the feature fusion of time domain weighting and frequency domain weighting, as follows:

$${\mathrm{F}}_{PCA\_TF}\left(\mathbf{X}\right)=g\left(\sigma \left(\phi \left(\mathbf{X}\right){\left(\psi \left(\mathbf{X}\right)\right)}^{\mathrm{T}}\right)\theta \left(\mathbf{X}\right)+{\left(\sigma \left({\left(\phi \left(\mathbf{X}\right)\right)}^{\mathrm{\rm T}}\psi \left(\mathbf{X}\right)\right){\left(\theta \left(\mathbf{X}\right)\right)}^{\mathrm{\rm T}}\right)}^{\mathrm{\rm T}}\right)+\mathbf{X}$$
(9)

3.4 Squeeze-and-excitation attention

CNN extracts feature by fusing information in local receptive field, and each convolutional kernel independently completes the fusion process. However, not the time–frequency information of each channel in feature map is equally important. HU et al. proposed SENet (General attention module, Fig. 6), which adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels. SE Attention is the core module of SENet, and we will introduce it. It is mainly divided into two parts: squeeze and excitation. As follows:

$${\mathrm{F}}_{SEA}={\mathrm{F}}_{\text{scale}}\left(\sigma \left({\mathrm{W}}_{2}\delta \left({\mathrm{W}}_{1}{\mathrm{F}}_{sq}\left(\mathbf{X}\right)\right)\right)\right)\odot \mathbf{X}$$
(10)

where \({\mathrm{F}}_{sq}\) denotes squeeze function, \({\mathrm{W}}_{1}\in {\mathrm{FC}}^{\frac{m}{r}\times m}\) and \({\mathrm{W}}_{2}\in {\mathrm{FC}}^{m\times \frac{m}{r}}\) denote two full connection layers, \(\delta\) denotes ReLU function, \({\mathrm{F}}_{\text{scale}}\) denotes scaling function.

Fig. 6
figure 6

Squeeze-and-Excitation Attention

First, introduce squeeze operation, using global average pooling to squeeze features on each channel in the feature map into an element. We only select one channel \({\mathbf{x}}_{i}\) to describe squeeze operation, as follows:

$${\mathrm{F}}_{sq}\left({\mathbf{x}}_{i}\right)=\frac{1}{f\times t}\sum_{\mathrm{p}=1}^{f}\sum_{\mathrm{q}=1}^{t}{u}_{k}\left(p,q\right)$$
(11)

Next, excitation operation, features are passed through \({\mathrm{W}}_{1}\) and \({\mathrm{W}}_{2}\). Two full connection layers learn the relationship between the channels and use different function to activate after each full connection layer. Finally, \({\mathrm{F}}_{\text{scale}}\) is used to copy the output feature, make its shape consistent with the feature map \(\mathbf{X}\), and multiply with \(\mathbf{X}\) channel by channel to adaptively calibrate the relationship between channels. This paper adopts the general attention module squeeze and exception attention (SE Attention) in SENet.

3.5 Dual parallel attention

The limited receptive field of CNN cannot capture the correlation between temporal frames of sounding at any moment and sound frequencies of all vibrations. At the same time, when CNN extracts feature of the Mel-spectrogram, convolutional kernel captures time–frequency information of different levels and fuses them into channels. However, not the time–frequency information on each channel is equally important. We design dual parallel attention Fig. 3, which builds global time–frequency dependencies in the song and distinguishes the contribution of each channel to feature map. Dual parallel attention is composed of parallel channel attention and SE Attention fusion, as follows:

$${F}_{DPA}\left(\mathbf{X}\right)={F}_{PCA}\left(\mathbf{X}\right)+{F}_{SEA}\left(\mathbf{X}\right)+\mathbf{X}$$
(12)

After the feature map \(\mathbf{X}\) is weighted by the parallel channel attention and SE Attention. next, the element-wise summation completes the feature fusion, and the shortcut connection is added to avoid losing the original information.

4 Dataset and experimental setup

4.1 Dataset and preprocessing

In this paper, dataset is GTZAN collected by Tzanetakis [12], widely applied in MGC. It includes 1000 songs, evenly distributed in each genre, 100 songs in each genre, of which ten music genres are Blues, Classical, Country, Hip-hop, Jazz, Metal, Pop, Reggae and Rock. Each song excerpt is about 30 s, stored as 22,050 Hz, 16 bits. To avoid repetitive information in multi-channel, we down-sample the song to 16,000 Hz, transform it to mono-channel processing.

We transform the song into Mel-spectrogram as input feature. The length of the FFT window is 512, hop length is 256, and the number of frequency bins is 128. We sliced the songs [14, 31, 35], and each song was divided into 11 music clips (lasting 5 s), and each clip overlaps by 50%, and the clip shape is 128 × 313 [5].In the experiment, train, validation, and test set ratio is divided into 8:1:1, and the balance between genres is maintained. In addition, the results of a single experiment in GTZAN fluctuate significantly, we use ten-fold cross-validation to ensure stability of results. In this paper, all test results are average values after ten runs.

4.2 Experimental setup

In this paper, Pytorch as deep learning platform, GPU is RTX 3090, Adam [48] as optimizer, batch size is 22, and loss function is Cross-Entropy. Each fold data training with 50 epochs and the tenfold cross-validation training for 500 epochs. The 0.0001 is the initial learning rate, which decays to one-tenth after 20 epochs. The convolutional kernel was initializing to Xavier Normal and Batch normalization initializing to constant. During training, we treat all song clips as independent samples for training. However, during verification and testing, we use a voting mechanism to select a genre with the highest probability from all the same song clips as final output result. For example, dividing a song into m clips and the song has k genres, prediction result of a song, as follows:

$$Y=\text{\hspace{0.33em}}\left(\begin{array}{cc}\begin{array}{cc}{y}_{11}& {y}_{12}\\ ...& ...\end{array}& \begin{array}{cc}...& {y}_{1k}\\ ...& ...\end{array}\\ \begin{array}{cc}{y}_{i1}& {y}_{i2}\\ {y}_{m1}& {y}_{m2}\end{array}& \begin{array}{cc}{y}_{ij}& {y}_{ik}\\ ...& {y}_{mk}\end{array}\end{array}\right)$$
(13)

where\(i\in \{1,\text{\hspace{0.33em}}2,\text{\hspace{0.33em}}...,\text{\hspace{0.33em}}{\text{m}}\}\),\({\text{j}}\in \{1,\text{\hspace{0.33em}}2,\text{\hspace{0.33em}}...,\text{\hspace{0.33em}}k\}\), \({y}_{ij}\) denotes probability of genre \(j\) in the \(i\) song clip, calculate the average probability of genre \(j\) in all clips, as follow:

$${y}_{j}=\frac{\sum_{i\text{\hspace{0.33em}}=\text{\hspace{0.33em}}1}^{m}{\text{y}}_{ij}}{m}$$
(14)

Then, selecting a genre with the highest probability as final output, as follows:

$${y}_{label}=\mathrm{max}({y}_{1},...,\text{\hspace{0.33em}}{y}_{j},\text{\hspace{0.33em}}...,\text{\hspace{0.33em}}{y}_{k})$$
(15)

5 Experimental results and analysis

5.1 Experimental results on the GTZAN dataset

In this section, we compare the proposed method with many existing methods, and the results are summarized in Table 2. KCNN(k = 5) + SVM, nnet2, and net1 networks are based on convolutional neural networks, shortcut connection, pooling, and other operations, and the network structure is redesigned. Although the classification accuracy of net1 network is higher than that of the backbone network CNN-5, net1 network cannot apply attention mechanism, so this paper does not improve the net1 network. In addition, from the final results, the accuracy of net1 network is lower than the method proposed in this paper. The accuracy of Transform learning method based on hybrid feature and transfer learning training and Hybrid model method based on a two-stage hybrid classifier is lower than the methods proposed in this paper. BRNN + PCNA and MhaNN-SVM methods combine the attention mechanism. BRNN + PCNA method proposes a model based on a bidirectional recurrent neural network and parallel attention. The mhaNN-SVM method uses multi-head attention as a feature extractor and SVM as a classifier to recognize all classes. These two methods do not consider the application of attention mechanism in time–frequency dependency. MS-SincResNet innovatively applies the method of extracting features from waveform signals to music genre classification and then sends the features into the deep neural network ResNet for classification.

Table 2 Comparison results of CNN-5 + DPA and existing methods on GTZAN dataset

MS-SincResNet is slightly better than the proposed CNN-5 + DPA in classification accuracy. Therefore, we compare CNN-5 + DPA with MS-SincResNet in detail from Params size, training time, and accuracy. Params size and training time are not provided in the MS-SincResNet paper. We experimented again according to the code provided in the paper.Footnote 1 As shown in Table 3, 91.49% (Chang) represents the accuracy of [36], and 90.20% (ours) represents the accuracy of our experiment. The Params size of MS-SincResNet is 43 MB, while CNN-5 + DPA is only a quarter of it, 10 MB. MS-SincResNet lasts 37 h, while CNN-5 + DPA lasts about 4 h, only one-ninth of its time. Although the accuracy of CNN-5 + DPA is 0.09% lower than MS-SincResNet, the Params size and training speed are much better than theirs. Therefore, CNN-5 + DPA is quite competitive in the methods mentioned in Table 2.

Table 3 Detailed comparison results between MS-SincResNet and CNN-5 + DPA

The experimental results of the test set in tenfold cross-validation are shown in Fig. 7. The red line represents the proposed model CNN-5 + DPA, and the blue line represents the backbone network CNN-5. The results of the CNN-5 + DPA model are better than those of CNN-5 in most fold data. CNN-5 is slightly better than CNN-5 + DPA in eightfold and tenfold data. Overall, the DPA proposed in this paper can improve the model performance steadily and effectively.

Fig. 7
figure 7

Ten-fold cross-validation test results

As shown in Fig. 8, taking the fourfold data in ten-fold cross-validation as an example, we provide the details of model training based on loss and accuracy. First, look at the loss figure. Training and validation loss in the first 20 epochs shows a downward trend. The training loss decreased steadily with the epoch increase, while the validation loss changed significantly, with the maximum value exceeding 6 and the minimum value only 1. After 20 epochs, the learning rate decreased to 0.00001. Between the 20th and 40th epoch, training and validation loss remained the same. The training loss was stable at around 0.03, and the validation loss was at about 0.5. After 40 epochs, the learning rate decreased to 0.000001. There was no significant change in training and validation loss, showing a convergence state, which remained around 0.03 and 0.5, respectively. Secondly, observing the accuracy figure. In the first 20 epochs, the training accuracy increased steadily with the increase of epoch, while the validation accuracy showed an out-of-order state. Similarly, the learning rate decayed after 20 epochs. Between the 20th and 40th epoch, the training accuracy reaches 100%, and the validation accuracy fluctuates from 90%. After 40 epochs, the learning rate decreased again. The training accuracy remains at 100%, and the validation accuracy is at 90%. After the first decay of the learning rate, the proposed model's loss remains stable, indicating that the model converges fast. After two decays of the learning rate, the validation accuracy is basically stable, and the model has converged in combination with the loss figure.

Fig. 8
figure 8

Loss and accuracy of training and validation of CNN-5 + DPA model

As illustrated in Fig. 9, the confusion matrix figure represents the comparison of predicted and actual results of applied DPA in CNN-5 on the GTZAN dataset. The higher diagonal value and the darker color, the higher recognition rate of music genre. We found that the classification accuracy of Classic and Blue is relatively high, reaching 99% and 97%, Pop and Rock classification accuracy is relatively low, only 84% and 81%. From the perspective of music style and experimental methods: classic and blues music styles are relatively stable, and the melody and beat of the song itself change little. This paper adopts the song slice and voting mechanism (as shown in Sect. 4.2) to combine the prediction probability of each slice and ensemble the final prediction result. Therefore, songs with consistent melody and style tend to be more easily recognized by the model. It is worth noting that the precision value of classical in Table 4 is not the highest. This index evaluates the proportion of songs whose actual genre is classical in the songs whose predicted genre is classical by the model. Although Classical has a high recognition rate, Jazz, which is similar in style, is easily misclassified as classical. Fewer songs were misclassified as Blues. Therefore, Blues' Precision index is the highest.

Fig. 9
figure 9

Confusion matrix of CNN-5 with DPA on GTZAN dataset

Table 4 Precision, Recall, and F-score of each genre obtained on the GTZAN dataset

Pop and Rock are generally marked by an inconstant rhythmic element, various styles, and a complex structure. Take Rock.23 of Rock genre in GTZAN dataset as an example. Rock.23 is a 30-s clip of the Bohemian Rhapsody Ballad part. Promane [9] argues that the Bohemian Rhapsody style fuses elements of Glam and progressive rock with those found in musical theatre, opera buffa, and vaudeville. The whole Rock.23 clip, vocal elements account for a prominent proportion, accompanied by a piano solo. Although the genre of Rock.23 is defined as Rock, the rock characteristics of the song are not obvious, and the content is not inclined to rock music. Under the method of song slice and voting mechanism, the more complex structure of a song, the more significant difference in the prediction results of each clip. It is unsuitable to employ a voting mechanism to ensemble the final results. Therefore, the various styles of songs and the method of this paper are fundamental reasons for the low recognition rate of Rock and Pop.

5.2 Ablation study for attention

In order to explore the effect of the weighting method based on time domain, frequency domain, and time–frequency domain for building global time–frequency dependencies in the spectrogram, we conducted experiments with different settings in Table 5. Similarly, to verify the function of two parts in DPA, we conducted experiments with different settings in Table 6.

Table 5 Experimental results of constructing global time–frequency dependencies by PCA based on time domain, frequency domain and time–frequency domain
Table 6 Ablation experimental results of PCA and SE Attention in DPA

As shown in Table 5, we found that applied PCA in CNN-5 brings a remarkable improvement in accuracy rate when comparing baseline CNN-5 with CNN-5 + PCA. Specifically, applied based on time domain, frequency domain, and time–frequency domain PCA in CNN-5 respectively improved 1.9%, 1.6%, 1.7% accuracy rate compared with baseline. Music is a 1D time-series signal, and it is the most effective to use based on time-domain PCA to build time–frequency dependencies. Similarly, music was expressed as frequency signals after Fourier transform, based on frequency domain PCA also works. However, classification accuracy based on the time–frequency domain PCA is not fantastic. We argue that: building time–frequency dependencies with time domain and frequency domain weighting fusion, which time-series and frequency fuse in the feature. However, the mixture features could not represent an audio signal, and classification accuracy has not improved further.

Table 6 indicates that applied PCA and SE Attention in CNN-5, respectively, improved by 1.9% and 1.2% compared with baseline. We found that applied PCA in.

CNN-5 improved model performance more than SE attention. These results note that when Mel-spectrogram is the input feature, there is a significant similarity between channel feature information. Even if SE attention is applied in CNN-5, the performance improvement of model is limited. However, the fixed receptive field of traditional CNN cannot capture global time–frequency information in the song, then applying PCA in CNN-5 works excellently. Finally, we applied DPA in CNN-5, and accuracy improved by 2.1% compared with baseline, outperforming SE Attention and PCA. This result verifies that DPA focuses on global time–frequency dependencies in the song and adaptively calibrates contribution of different channels to the feature map. (PCA used time domain weighting). In addition, we can observe that with CNN-5 only, the params size is 6.0 MB. When SE Attention and PCA (Time domain) are applied separately, the params size are 8.0 MB and 8.4 MB, respectively. It can be seen that SE Attention and PCA (Time domain) have increased the params size by 2.0 MB and 2.4 MB. Finally, when DPA is applied, the params size is 10.4 MB. This shows that when the two attention mechanisms are applied in parallel, the params size of model is accumulated by adding.

5.3 Attention applied in CNN-5

In this section, we study the effect of different numbers and positions of DPA applied in CNN-5 for performance. Specifically, we performed experiments on applying different numbers of DPA in CNN-5. Next, fixed the number of DPA and experiments with DPA applied in different positions of CNN-5. As shown in Fig. 10, it improves performance most that applied DPA in the second, third, fourth, fifth layer of CNN-5, which is higher 2.1% than baseline. It brings the least improvement that applied DPA in second, third and third, fifth layer of CNN-5, which is only 1.0% higher than baseline. In Fig. 11, we found that the performance is sensitive to positional relationship when two DPA applied in CNN-5, the gap between Max and Min reach 1.0%, and average value is closer to Min. When three DPA applied in CNN-5, performance has a minor dependency in position, the gap between Max and Min only 0.2%, and average value in the middle of extreme values. (In this paper, applied attention to second, third, fourth, and fifth layers of CNN-5).

Fig. 10
figure 10

The accuracy rate of different numbers and positions of attention applied in CNN-5. The horizontal axis denotes position and number of attention mechanism applied in CNN-5, and the vertical axis denotes accuracy rate. For example: "245" means that attention applied in the second, fourth, and fifth layers of CNN-5

Fig. 11
figure 11

The maximum value (Max), minimum value (Min), an average value of all positions (Average) denotes accuracy that applied the same number of attention mechanism in different positions of CNN-5

5.4 Contrast study for multiple attention mechanism

In this section, we compare DPA with Non-local, dual attention networks (DANet) [51], frame-level attention (FLA), and parallel time–frequency attention (PTS-A). Non- local and DANet are similar to DPA in architecture, which builds long-distance feature dependencies for features weighted summation through weight matrix. FLA and PTS-A are similar to DPA in function, which combines time-series or frequency characteristics of the audio signal.Footnote 2

As shown in Table 7, we proposed the DPA has highest accuracy. The results show that the accuracy of applying Non-local and DANet in CNN-5 is 3% lower than the baseline. We argue that: squeezed the channel into a row, and then the features of all channels are multiplied together to obtain the weight matrix representing any two feature.dependencies in the feature map, and finally, the weighted summation of original feature to build global dependencies, this method, Non-local and DANet, is not suitable to a spectrogram. The spectrogram is composed of each temporal frame in the time domain or each sound frequency in the frequency domain arranged in parallel. Squeezing the spectrogram into a row, the time frame or sound frequency is connected end to end, violating the spectrogram's characteristic. Therefore, the accuracy rate decreases. In addition, the accuracy of applying FLA and PTS-A in CNN-5 is also lower than that of CNN-5 with DPA.

Table 7 Comparison results of applying multiple attentions in CNN-5

6 Conclusion

Automatic music genre classification is a research topic that classifies music (songs) into different genres according to their content. This can replace tedious manual labeling methods and provide a theoretical basis for commercial applications of music genre classification. In this paper, we proposed to apply dual parallel attention (DPA) in CNN-5 for music genre classification. DPA is composed of parallel channel attention (PCA) and SE Attention. PCA employs a weight matrix to construct new feature by weighted summation of time–frequency information to build global time–frequency dependencies in the song. This paper also has research on the effect of the weighting method based on time domain, frequency domain, and time–frequency domain for building global time–frequency dependencies. Among these methods, the based time domain method is the most effective. In addition, we analyzed the effect of different numbers and positions of DPA applied in CNN-5. The experimental results demonstrate that: the performance of applying DPA in the second, third, fourth, and fifth layers of CNN-5 is the most outstanding. Moreover, when two DPA are applied, the change of position relationship has a sensitive effect on the performance. When three DPA are applied, performance has a minor dependency on position relationships. Compared to the Non-local and DANet, etcetera attention mechanism, the classification accuracy of DPA is the highest in the optimal application position setting.

In commercial applications: This method can provide users with acceptable classification accuracy by retrieving music from different genres. More importantly, we propose building global dependencies of the song. It is worth thinking deeply and can provide new ideas for music genre classification. Similarly, there are still areas for improvement in this article. When DPA is used in the basic network CNN-5, it brings more parameters and computational overhead to the model. However, the classification accuracy of the model is not significantly improved. In future work, we will continue to focus on how to reduce the computational complexity of DPA. At the same time, recognizing music genres with rich styles based on hand-crafted features is worth researching.