1 Introduction

Various forms of the interactions between machine and human have grown in the last decade, including smart home appliances, smart stores, automotive industry, security and mobile phones. The human interaction happens primarily using audio-visual form and gestures. Because of the recent technological advancements in the audio and music field, an enormous amount of data is available locally or over the network. The data content search and information retrieval is a demanding task in a variety of applications such as music mood classification, genre classification, melody identification, acoustic scene classification and cover song identification [66].

The audio analysis tasks such as audio surveillance, music genre recognition, sound event classification, acoustic scene classification require robust and most discriminating features. The conventional audio feature extraction methods are categorized as: (a) time-domain features and (b) frequency-domain features [66]. Zero-crossing rate (ZCR), signal energy, maximum amplitude, and auto-correlation based features are the few examples of time-domain features. The frequency-domain features include fundamental frequency, spectral centroid, spectral flux, spectral density, spectral roll-off, chroma features, Mel-frequency cepstral coefficients (MFFC) and linear predictive coding (LPC). More often, these features are combined to enhance the algorithm performance in various applications.

Recently, a new audio feature extraction approach using time-frequency texture image is developed and employed for different applications. In this technique, the input audio signal is first converted into a time-frequency image (such as spectrogram or MFCC or Cochleagram image) and then textural features are extracted from this visual representation. The distinctive two-dimensional time-frequency visualization can produce better features for audio detection and classification tasks. Such texture features are expected to be complementary to the conventional features to construct a robust audio classification system. Because of the non-uniformity of the textures in the visual image, usually, local feature feature extraction is considered during the feature construction phase.

Researchers in 1970s initiated early efforts of understanding and analyzing visual information in the form of spectrogram image. Visual information of spectrogram image was employed in 1970s and 1980s for identifying the phonetic contents [86], continuous speech recognition [38], multi-speaker continuous speech recognition [22] and stop-consonants identification from continuous speech [87]. The manual analysis of visual information by different researchers was limited to spectrogram image and difficult because of the complex speech structure. More distinctive textural variations are observed in the time-frequency representation for the short duration audio sample. Different textural descriptors effectively capture these variations for a variety of audio classification task in the recent past.

A comprehensive survey of time-frequency image texture feature extraction algorithms in audio applications is presented in this article. To the best of our knowledge, this is the first attempt to survey different image texture feature extraction techniques for speech, music, audio and environment sound classification. A total of 77 papers from top-tier journals and conferences in the last twelve years are collected. All the articles collected are between the years 2009 to 2020, and year-wise number of papers appeared are illustrated in Fig. 1. This literature has been categorized based on (a) audio signal, (b) time-frequency representations and (c) applications.

Fig. 1
figure 1

A graph depicting number of papers published during 2009-2020

Based on the audio signal type, different articles are divided in to: (a) speech (b) music (c) environment/acoustic sound and (d) other applications. Figure 2 depicts the percentage of papers from each of the audio type. The other category includes audio signals such as bird sounds, baby cry and bird vocalization identification. From Fig. 2 it is clear that most algorithms are focused on music and environment/acoustic sound signal analysis and classification. The second type of categorization is based on the use of time-frequency visual representation.

Fig. 2
figure 2

Classification of various articles according to the audio signal type

Different types of time-frequency images are used in the literature for texture feature extraction like spectrogram, cochleagram, Constant-Q transforms (CQT) and MFCC. Figure 3 illustrates the classification of various algorithms according to the time-frequency image. Spectrogram images are primarily employed for feature extraction by researchers in different application development. The spectrogram image provides more distinctive patterns for the classification or identification task compared to other image representations.

Fig. 3
figure 3

Classification of various algorithms according to the time-frequency image

The last classification is based on applications of the time-frequency texture image. We have classified the proposed approaches in five broad application areas: (1) music genre identification (2) acoustic scene classification (3) bird and animal species classification (4) sound event classification and (5) other applications. Figure 4 shows the classification and percentage of articles in each of the application. The time-frequency texture image features are first introduced for music genre classification and then extended for other application areas. Table 1 depicts list of abbreviations used in the article.

Fig. 4
figure 4

Broad areas of application development classification of various algorithms

Table 1 List of abbreviations

Generalized architecture of time-frequency texture feature extraction approaches in audio classification algorithms: The different audio classification approaches in the literature based on time-frequency texture image encompasses time-frequency image generation, texture feature extraction and classification model. We have developed the generalized architecture of time-frequency texture feature extraction algorithms in audio classification tasks as illustrated in Fig. 5.

Fig. 5
figure 5

Generalized architecture of time-frequency texture feature extraction approaches in audio classification algorithms

The first step is to generate visual representation such as spectrogram, cochleagram, CQT image from the input audio signal. In the second step, textural descriptors are extracted from the audio image. LBP, LTP, LPQ, and RLBP descriptors are commonly used for the feature extraction task. Due to the non-uniformity of textures in time-frequency image, local feature extraction is considered by employing zoning during the feature construction stage.

The textural descriptors like LPQ, LPQ and LTP extracted from the time-frequency texture image produces large dimensional feature vector. To speed-up the computation and to reduce classification complexity, feature selection is often employed before the classifier stage. Feature selection stage removes redundant and less critical features leaving only relevant descriptors, hence creating small dimensional final feature set. Finally, classification is implemented using a support vector machine, k-nearest neighbor or neural network classifier. It is observed from the literature that, SVM is most popularly used classification approach because of its excellent performance even in noisy conditions.

A systematic methodology to compose a comprehensive record of the state-of-art algorithms focusing on time-frequency image texture features employed in audio applications is illustrated. The major contributions of the survey article can be summarized as:

  • A comprehensive survey of the state-of-art algorithms focusing on time-frequency image texture features is demonstrated.

  • Generalized architecture of time-frequency texture feature extraction approaches in audio classification algorithms is presented.

  • Presents a critical review of different time-frequency representations with their features, advantages, and limitations employed in audio classification tasks.

  • Furnishes a brief review of various textural descriptors with their advantages and disadvantages utilized for feature extraction.

  • Presents limitations and challenges of existing techniques.

The article is organized as follows. Firstly, the published state-of-the art algorithms are categorized into three different classes. Section 2 describe time-frequency image representations in detail along with the applications. Different textural features and classification algorithms are discussed in Sections 3 and 4. Section 5 outlines challenges, advantages and limitations for implementing various audio applications using time-frequency texture image. Finally, Section 6 concludes the article.

2 Time-frequency visual representation

In time-frequency texture image based audio feature extraction technique, the input audio signal is first converted into a time-frequency image such as spectrogram or MFCC or cochleagram image, and then textural features are extracted from this visual representation. This section demonstrates different time-frequency visualizations employed for feature extraction and key aspects are compared.

2.1 Spectrogram

A spectrogram is a two-dimensional visual presentation of signal strength at a different frequency that varies with time. Since spectrogram provides profound attributes, it is popularly employed in a variety of speech and music processing applications. In a spectrogram, vertical axis illustrates frequency and the horizontal axis represents time and energy content present are depicted in the form of grayscale level. Recently, spectrogram image texture is characterized in various applications in order to capture the relevant details. The various types of spectrograms can be categorized as log-mel spectrogram [53], IIR-CQT spectrogram [14], and linear spectrogram [8, 16, 18, 19, 30]. Moreover, the spectrogram is also classified as narrowband and wideband based on the analysis window utilized.

To generate a spectrogram time-frequency image, firstly the input speech sample x(i) is segmented into windows having length N frames. Later, these frames are transformed into frequency-domain by applying windowed Fourier transform as,

$$ X_{t}(k)= \sum\limits_{i=0}^{N-1} x(i) \omega(i) e^{- \frac{2\pi i}{N}ki} for k=0,\dots,(N-1) $$
(1)

where, ω(i) = Hamming window, k is frequency, f(k) = kFs/N and Fs is the sampling frequency. Finally, linear or log power is used to create the spectrogram as,

$$ S_{Lin}(k,t)=\left| X_{t}(k)\right| $$
$$ S_{\text{Log}}(k,t)= \log S_{Lin}(k,t) $$
(2)

Some methods employ normalized grayscale spectrogram intensity image before extracting textural features [30].

An early attempt of time-frequency texture image feature extraction is presented in [81]. The proposed approach classifies different musical instruments using minimum-block matching of energy coefficients as features extracted from spectrogram visual image with 85% accuracy rate. Music genre classification approach is illustrated based on a scale-invariant feature transform (SIFT) keypoint features extracted from spectrogram time-frequency image [31, 43]. Local dynamic details are effectively represented using SIFT keypoint descriptors and classified using support vector machine (SVM) classifier attaining 82.7% classification rate. A method based on central moment features extracted from spectrogram image and one-against-one (OAO) multi-class SVM classifier for mismatched conditions sound event classification is illustrated in [30].

Music genre classification performance enhancement is observed by fusing acoustic and visual features (central moments) extracted from spectrogram image in [74]. The combined feature set is classified using SVM resulting in 86.1% average accuracy rate. Grey-level co-occurrence matrix (GLCM) features and classifier voting mechanism is introduced for music genre classification with 67.2% average accuracy [26]. 28-D GLCM and 59-D local binary pattern (LBP) textural descriptors are extracted from each spectrogram image zone for music genre classification in [25, 27]. Additionally, the effect of individual classifier assignment to each Mel scale zone and combination of different classifiers are investigated in [23]. SVM classification fusion rules such as min, max, sum rule are employed for different zones to increase the classification accuracy.

The performance of sound event classification scheme under noisy mismatched environment is enhanced in [29]. The method uses sub-band power distribution and spectrogram image features and classified using SVM resulting over 96% accuracy. Local phase quantization (LPQ) and Gabor filter features are extracted from a spectrogram image for music genre classification and classified using SVM with 80.78% [24]. It was observed that LPQ outperforms Gabor features when obtained globally from the spectrogram. Combination of nonlinear classifiers in addition to the Gabor and LBP features are explored for music genre classification with 84.9% accuracy rate in [72, 73]. Each music genre has unique spectrogram signature as illustrated in Fig. 6. A method to classify the music signal input into instrument and the song is designed using spectrogram visual intensity co-occurrence descriptors and random sample consensus (RANSAC) classification model [34].

Fig. 6
figure 6

Sample music clips shown using spectrogram: (a) classical and (b) disco [72]

Audio surveillance in a noisy environment is analyzed using MFCC and central moment features extracted from the spectrogram image and multi-class SVM classifier [60]. It was observed that the linear grayscale descriptors are robust compared to log-grayscale features in a noisy environment. Music genre classification using ten different descriptors and three different spectrogram types (linear, global and mel scale zoning) is evaluated in [48]. An average accuracy of 86.1% is achieved using 45 SVMs trained for every texture features and combined using sum rule for the final decision.

Spectrogram image local statistics and SVM are utilized for environmental sound classification in [39] with an impressive accuracy rate of 98.62%. Besides, L2-Hellinger based feature normalization approach has proved enhanced robustness and added discriminating power. The codebook is created using the k-means clustering algorithm of LBP feature map from the spectrogram image and classified using SVM for acoustic context identification [9]. The bag-of-features (BoF) technique utilized reduces the computational complexity of the algorithm. Two different LBP variants RIC-LBP and μ LBP in addition to LBP and SVM classifier are employed for music genre classification with 84% accuracy [5].

The amplitude histogram from each frequency band is extracted as subband power distribution (SPD) features and histogram of gradient (HoG) feature for acoustic scene classification in [17]. Moreover, earth mover distance is employed to compare histograms, and it was found that Sinkhorn kernel improved the classification performance. Prosodic cues in the language are effectively modeled using LPQ descriptors extracted from the spectrogram image of language utterance and employed for language identification [45].

The mean and standard deviation of central moments are computed from the linear grayscale spectrogram image and classified using one-against-all (OAA-SVM) classifier with an improved classification rate of 98.16% [62, 64]. However, higher training time is required for the OAA approach compared to other multi-class SVM techniques. In another approach, the authors used GLCM features and SVM classifier fusion to obtain the accuracy of 90.20% [63]. The sub-band frequency analysis has produced higher accuracy rate, however, generates large dimensional feature vector. A method based on entropy, third-order moments and directionality features and SVM classifier are developed for identification of ground moving targets [67]. Multilevel feature extraction from spectrogram time-frequency visual representation is introduced for music genre classification in [75]. A late classifier fusion of acoustic and visual descriptors is suggested with 88.60% classification accuracy.

A sound event identification in noisy conditions using LBP and HOG descriptors from spectrogram images is presented in [42]. Moreover, the global characteristics are exploited using bag-of-audio words and classified using SVM attaining 69.28% average accuracy. Music genre classification approach using spectrogram based gradient directional pattern is formulated using SVM classifier with 84.5% accuracy in [6]. Bird species identification algorithm using spectrogram based different texture features such as local ternary patterns (LTP) quantization, auto-similarities and LBP variants are implemented [52]. Combining textural features with acoustic has improved the classification rate up to 94.5%/.

Music genre classification algorithm combining different descriptors extracted from Mel-scaled spectrogram image and fusion of heterogeneous classifiers is presented in [50]. The proposed technique is evaluated over LMD, ISMIR 2004 and the GTZAN database with 84.9% highest classification accuracy. Robotic hearing sound event classification in noisy conditions using multi-channel band independent LBP textural descriptors is evaluated using RWCP and NTU-SEC database in [57, 70]. The study revealed that Gammatone spectrogram in the logarithm domain is more appropriate for textural analysis of sound. A combination of LBP, RLBP and LPQ textural features are constructed from spectrogram image representation for acoustic scene classification resulting 80.17% accuracy rate evaluated on DCASE2016 database [33]. Additionally, combining the left and right audio channel for feature extraction increases the classification performance.

Spectrogram texture descriptors using GLCM and SVM classification scheme are utilized for discriminating laryngeal mechanism with an average accuracy rate of 86.16% [40]. A set of texture features are extracted from the spectrogram, rhythm image and gammatonegram images after dividing it into sub-windows and trained SVM classifier in [49]. The proposed method is evaluated on different databases like GTZAN, ISMIR 2004 and LMD. An automatic method for bird and whale species identification using three different spectrograms and multiple texture descriptors is presented [47]. In addition to visual features, acoustic features are combined to enhance the identification rate measured using OAA-SVM. The class imbalance issue in music genre classification is addressed by applying oversampling and undersampling in [69]. LBP features are extracted after vertical splicing of spectrogram image and classified using several classifiers.

Bird species identification is investigated using three different textural descriptors and SVM classification attaining 71% accuracy rate in [85]. The dissimilarity approach employed in the algorithm performs better even in case of a large number of input classes. Chinese regional folk-songs recognition approach using auditory features and visual textural descriptors is formulated in [78]. Ensemble SVM classification evaluated on three different Chinese folk-song databases achieved 89.29%. Emotion recognition from speech signal algorithm using spectrogram visual images LBP texture features and SVM classification is constructed in [54]. Highest identification rate of 84.5% is achieved using EMO-DP database. Initially, acoustic events are represented using a Gaussian mixture model (GMM) energy detection approach and acoustic and visual features are extracted for bird species identification [82]. Relief feature selection algorithm and SVM classifier applied on real-world bird species database resulted in 96.7% classification accuracy.

A method to discriminate snore sounds is designed based on HOG and LBP features from spectrogram visualization and SVM classification resulting in 72.6% accuracy in [28]. Figure 7 depicts various snore sound spectrogram images related to different vibration point such as velum, oropharyngeal lateral walls, tongue and epiglottis. Speech music classification algorithm is developed using major spectral-peak locations and identification of these sequences and three different classifiers SVM, GMM, and random forest classifier with 98% accuracy rate [10]. The periodicity, the average frequency and statistics of these peak sequences are finally used as features. The speech and music differences are clearly identified using the spectrogram shown in Fig. 8. Animal sound recognition approach using double spectrogram features: (a) projection features and (b) LBP variance features with random forest classifier is formulated in [41]. The combined feature set greatly enhances the classification performance attaining 98.02% accuracy.

Fig. 7
figure 7

Snore sound spectrogram images related to vibration point (a) Velum, (b) Oropharyngeal lateral walls (c) Tongue and (d) Epiglottis [28]

Fig. 8
figure 8

Spectrograms representation of (a) Speech sample and (b) Music sample [10]

Speech emotion recognition technique using bag-of-visual words extracted from spectrogram image is developed [68]. The visual vocabulary is constructed and classified using SVM classification evaluated on four different datasets. Figure 11 illustrates spectrogram images of various emotions created using EMO-DB database with and without noise. Chatter detection method using spectrogram time-frequency image and GLCM features is proposed in [20]. The machine condition using vibration signal analysis is performed by identifying high-energy dominant frequency bands from the spectrogram image. The vibration signal spectrograms are different at stable and unstable conditions that is shown in Fig. 9. A method to identify the motivation of infants’ cry, i.e. because of feeling or pain is presented in [32] using textural features extracted from spectrogram image. Experiments are carried at different noise conditions and classifier fusion strategies.

Fig. 9
figure 9

Spectrogram images of vibration signals of (a) stable and (b) unstable [20]

Speech spoofing detection using spectrogram image LBP texture features and SVM classification is introduced in [55] with 71.67% average accuracy rate. The generalized Gaussian distribution (GGD) parameters are extracted from a non-subsampled Contourlet transform (NSCT) sub-bands for speech and music discrimination in [14]. The spectrogram image is decomposed using NSCT, and estimated parameters are employed for classification using extreme learning machine (ELM) classifier. The higher-order statistics are encoded using Fisher vectors from the spectrogram monochrome image and classified using SVM classifier resulting highest accuracy of 92.27% [46]. Spectrogram texture descriptor based Indian language identification technique is developed in [21, 35]. Each individual languages has different spectrogram visualizations as depicted in Fig. 10. CLBP, LBPHF and DWT texture features are extracted, and artificial neural network (ANN) classifier is used attaining 96.96% average identification rate (Fig. 11).

Fig. 10
figure 10

Spectrogram visualization of various Indian language speech samples. [21]

Fig. 11
figure 11

Spectrogram images of various emotions created using EMO-DB database. First row depicts original audio samples whereas second row shows samples after noise addition [68]

Acoustic scene classification method using GLCM features extracted from log-mel spectrogram and SVM classifier is described in [53]. The dimensionality of the feature vector is reduced using principal component analysis, and the method achieved 83.2% classification rate evaluated on DCASE 2016 database. In [83] speech resampling manipulation algorithm based on spectrogram LBP features is presented. The forensic investigation of resampling operation is detected using SVM classifier. Tables 2 and 3 depicts summary of different techniques proposed in the literature based on spectrogram time-frequency visualization. Recently robust acoustic event recognition using gray scale spectrogram is presented [84].

Table 2 Summary of different algorithms based on spectrogram time-frequency representation
Table 3 Summary of different algorithms based on spectrogram time-frequency representation

2.2 Cocheleagram

The cochleogram also known as gammatonegram imitates outer and middle human ear components. It relies upon the gammatone warping function which fits empirical observations of frequency selectivity in the mammalian cochlea, with an impulse response g(t) given by

$$ g(t)= at^{P-1} \cos(2\pi fct + \phi) e^{\-2\pi bt} $$
(3)

where t is time, a is amplitude, P represents the filter order, ϕ is the phase shift, fc is the central frequency (in kHz).

In [61], sound signal time-frequency representation based on cochleagram, which uses a gammatone filter, was found very effectual than spectrogram image. Comprehensive classification performance is shown using all three equivalent rectangular bandwidth (ERB) filter models. It is also observed that cochleagram image features at low signal-to-noise ratios (SNRs) give better results. For feature extraction, the work presented in [65] utilizes pseudo-color cochleagram image of sound signals for robust acoustic event recognition as illustrated in Fig. 12. For improving characterization from environmental noise, the author mapped grayscale cochleagram image to higher-dimensional color space. The result shows notable improvement at low signal to noise ratios.

Fig. 12
figure 12

Different color map representation of a pseudo-color cochleagram of sample sound [65]

An automated cough sounds analysis methods to diagnose croup properly is presented in [59]. In this article, the authors used cochleagram visual representation frequency components based on selectivity property of human cochlea, as shown in Fig. 13. The proposed algorithm results in a sensitivity and selectivity 92.31% and 85.29%, respectively, for croup and non-croup patient classification. In [44], deep neural network back-end classifiers are explored using three different 2-D time-frequency features for audio event classification. Along with the cochleagram, authors utilized spectrogram and CQT based images. Significant improvement in the results are achieved, which shows cochleagram image feature performs well in extreme noise cases of -5dB and -10dB SNR. Indian Language identification using cochleagram image texture descriptors and ANN classifier with 95.36% average accuracy is illustrated in [37]. Cochleagram image-based algorithms and applications are summarized in Table 4.

Fig. 13
figure 13

(a) Time domain normal cough (b) time domain croupy cough (c) spectrogram of normal cough (d) spectrogram of croupy cough (e) cochleagram of normal cough (f) cochleagram of croupy cough [59]

Table 4 Summary of different algorithms based on Cocheleagram time-frequency visual representation

2.3 Chromagram

In the chromagram, for the music signal shifting of time window results in a chroma features sequence. Each pitch content represents spread over 12 chroma bands within time window [71]. This time-frequency representation is known as Chromagram. A chroma feature vector is also known as pitch class profile (PCP). This is a well-built tool for analyzing music whose tuning closes to equal-tempered scale and whose pitches can be meaningfully sorted. The important property of chroma features is that it can capture harmonic features and melodic features of music, on the contrary of changes in timbre and instrumentation. The chromagram feature vector consists of the 12-dimensional short-time energy distribution of a music signal. These 12 PCPs achieves a frame-wise spectral energy mapping onto spectral bins which correspond to the twelve semi-tones of the chromatic scales for each analysis frame.

Chroma vector utilizes an octave invariance principle which states that there is no functional difference between musical notes separated by doubling of frequency. It is computed with the help of grouping the discrete Fourier transform (DFT) coefficients of a short-term window into 12 bins. Each bin represents one of the 12 equal-tempered pitch classes of Western-type music (semi-tone spacing) [58].

Music specific chromagram representation and ULBP textural feature are utilized for speech/music signal classification in [15]. The use of chromagram representation based visual and spectral features efficiently extracts melodic and harmonic details of music signal otherwise absent in speech as depicted in Fig. 14. In this study, the eigenvector centrality feature selection is used that enhances the detection performance. It was observed 24 bin chromagram representation is sufficient to explorer music tonality features for the speech/music classification.

Fig. 14
figure 14

Chromagram visual representation obtained using 12 bin music and speech signals from Scheirer and Slaney database [15]

2.4 Constant-Q Transform (CQT)

The CQT transform provides time-domain to frequency-domain signal transformation producing a log-scale frequency resolution similar to perception of auditory delivering fine resolution at low frequency [1]. In the constant-Q transform (CQT), the c(k,t) over k frequency bins of time domain signal s(t) is defined as

$$ c(k,t)=\sum\limits_{n=t-w_{s}/2}^{t+w_{s}/2} s(n)ak^{*}(n-t+w_{s}/2) $$
(4)

where ak(n) is the complex conjugate of time–frequency atoms which are defined by and w(n) a window function over length ws. The major difference between CQT and a spectrogram is that ws is itself a variable rather than a constant.

The algorithm demonstrated in [80] employs distinct models to use sound textures and events in acoustic scenes. The framework achieved superior results in real data evaluation. With the Rouen dataset, the proposed algorithm performed better compared to other existing approaches. Novel features obtained in [56] by constant Q-transform followed by appropriate pooling. Experimentally it is proved that HOGs computed from constant Q-transform were useful capturing specific features present in time-frequency (TF) representation. This HOG based feature proved globally efficient. Novel zoning approach, along with time-frequency representation (TFR) to improve the classification performance for acoustic scene classification is described in [1]. The technique achieved accuracy up to 95.2%. Abidin et al. [2] presented an algorithm which fuses spectral and temporal features for acoustic scene classification. For the generation of T-F representation, variable Q-transform is used which improved the classification rate by 5.2%. Figure 15 clearly depicts the difference between beach and cafe scenes using CQT visual representation.

Fig. 15
figure 15

Constant-Q transform representation for beach sound (left) and cafe sound (right) scenes. [2]

In [4], audio signal is converted to CQT representations first and later LBP textural features are extracted from CQT T-F representations. The proposed system achieved an accuracy of 85% on the DCASE 2016 datatset. Joint T-F image-based feature representations are found effective in [3]. These joint features produce better results across a wide range of low and middle frequencies in the audio signal attaining the classification accuracy of 83.4%. TFR with zoning technique in combination with image-based features is very productive and computationally efficient for the ASC. In [76], acoustic and visual feature are fused with a various set of features for acoustic scene classification. RelieF algorithm, correlation-based feature (CFS) and principal component analysis (PCA) techniques are used for feature selection. Use of feature selection improved the algorithm performance and reduced the feature vector dimensionality. Table 5 shows a summary of different techniques based on CQT image feature extraction.

Table 5 Summary of different algorithms based on Constant-Q transform time-frequency visual representation

2.5 Other time-frequency representations

In this subsection, all remaining time-frequency representation methods are presented. Mel-frequency cepstral coefficients (MFCCs) is one of the most popularly used feature extraction schemes for audio analysis. MFCC filter banks mimic human auditory producing discriminating features in speech processing applications. In [77], the temporal dynamics present in the audio sample is extracted using subband MFCC time-frequency image and LBP texture features for acoustic sound classification. The work explores three frequency bands spanning from 0 to 11 kHz with 23 mseconds of a time window for each frame. The method developed in [77] achieved an improvement of 8% using a D3C ensemble classifier.

Harmonic and percussion images are produced using harmonic-percussion separation (HPSS) algorithm and various texture descriptors extracted from these images are employed for music genre classification in [51]. Application of median filtering across frequency axis percussive occurrence is highlighted, whereas, across the time-bin application of median filtering, the harmonic regions are enhanced. In the same work, authors presented scattergram image-based textural features for music genre classification. The ScatNet scattering framework is employed for generating the scattergram. The speed of the audio can be measured by tempo in beats per minute. The tempogram based feature set is exploited using the novelty curve from the input audio signal for speech and non-speech signal classification in [79]. ThehHighest classification rate of 99.20% is achieved by the proposed approach using multi-layer perceptron classifier and correlation-based feature selection.

Two-dimensional neurogram is generated using physiological computation model of the auditory periphery for phoneme classification and voice activity detection. DCT coefficients from a neurogram image are extracted as features and classified using multi-layer perceptron for voice activity detection in [36]. A new phoneme classification technique based on discrete Radon transform features is illustrated in [7]. The method exhibited better performance evaluated under noisy conditions compared to other conventional techniques (Table 6).

Table 6 Summary of different algorithms based on MFCC, neurogram, scattergram, and tempogram time-frequency visual representation

3 Texture descriptors

Textural descriptors can extract the presence of prominent visual content in time-frequency image. Texture analysis is a process of distinguishing different textures into separate classes by identifying key features. Discerning an effective texture feature is a crucial step for enhancing the algorithm performance. Several texture descriptors such as LBP, LPQ, GLCM, HOG, Gabor filters, central moments, and other LBP variants are used in the literature for an audio classification task. This section briefly summarizes most widely descriptors.

3.1 Local Binary Pattern (LBP) and LBP variants

Local Binary Pattern (LBP) is the most widely texture encoding scheme use in literature. LBP imparts remarkable performance in all audio application algorithms, including music genre recognition, bird species classification, and acoustic scene classification [51, 77]. LBP operates on the local neighbourhood of a central pixel to find a local binary pattern. This is important because of the non-uniformity of the textures in a visual image; usually, local feature extraction is considered during the feature construction phase. The feature vector which describes the textural content of the image corresponds to the histogram of local binary patterns found in all pixels of the image. Two parameters are important during the LBP feature extraction: the first one is the number of neighbouring pixels that will be taken into account for the central pixel (P), the second one is related to the distance between the central pixel and its neighbours (R) [2, 76].

Local binary pattern (LBP) is used in [1, 2, 5, 23, 25, 27, 28, 32, 33, 42, 47,48,49, 51, 51, 52, 55, 57, 69, 73, 76,77,78, 80, 83, 85] attaining better performance in various applications. Different LBP variants are also used such as RICLBP, CoALBP, and NTLBP [47,48,49,50,51], RLBP [32, 33, 47, 49, 51, 85], ULP [82], LBPHF [21, 82], ECLBP [3], CLBP [21, 37, 51], and μ LBP and RILBP [5, 51].

3.2 Grey-Level Co-occurrence Matrix (GLCM)

The spatial relationship among local pixels is examined in grey-level co-occurrence matrix (GLCM) textural descriptor. This is also known as the gray-level spatial dependence matrix. The GLCM specifies the texture of an image by computing pairs of pixel and characterizing spatial relationship present in an image. Different statistical measures include energy, correlation, energy and homogeneity. GLCM is widely used feature extraction scheme after LBP and HOG in different algorithms [20, 26, 27, 34, 37, 40, 53, 54, 63, 79, 82].

3.3 Histogram of Oriented Gradients (HOG)

Histogram of oriented gradients (HOG) effectively extracts shape and appearance from an image using edge directions or intensity distribution. Similar to LBP. HOG is obtained by dividing input image into small regions and concatenation histogram. HOG descriptors are popularly used after LBP features. Several algorithms in which HOG descriptors are employed includes [1, 4, 17, 28, 42, 48, 50, 54, 56, 76, 80].

3.4 Local Phase Quantization (LPQ)

A robust symmetric blur descriptor is developed by estimating the phase angles of Fourier transform at different frequencies [24, 45]. LPQ feature extraction approach effectively characterizes underlying textural variations existing in an image. In the literature, various techniques based on LPQ descriptors are [24, 32, 33, 37, 45, 47, 49, 51, 85].

3.5 Other texture features

In addition to the textural features described above, few other descriptors are employed by different researchers. Mostly these features are combined to the LBP, LPQ, GLCM and HOG before the classification. Enhanced algorithm performance is attained by fusing these textural descriptors. These feature are: Block energy [43], Central moments [30, 59,60,61,62, 67, 76], Gabor features [47, 49, 73,74,75, 82], LTP and HASC [52], and WLD: [50].

4 Classifiers

The suitable choice of a classifier is one of the dominant factors in classification. Most widely methodologies used for classification found in various works are support vector machine (SVM), linear discriminant analysis(LDA), artificial neural network (ANN), k-nearest neighbor (KNN) and random forest (RF). This section briefly summarizes the different classification algorithms used.

Support vector machine (SVM) is one of the most widely supervised machine learning algorithms in audio classification goal. The SVM technique is the most common among linear separation algorithms since it is virtually parameter free and has shown that it can have the same or better performance than other more complex algorithms. Variety of kernel functions are employed in SVM such as, linear, polynomial, Gaussian and RBF kernels. SVM implementation using Libsvm is a popular choice of various researchers [32, 83]. However, LibLinear package is also employed in some studies as illustrated in [28]. Almost over 70% algorithms employed SVM as a classification algorithm. For real-world recognition tasks, SVM based multi-class classification method appears to be very appropriate. Various works that employed SVM are [5, 9, 17, 20, 24,25,26,27,28,29,30, 32, 39,40,41,42,43, 45, 47, 48, 52, 53, 55, 57, 59, 61,62,63, 65, 67, 68, 74, 75, 78, 82, 83, 85].

K-nearest neighbors (KNN) is a simple non-parametric algorithm used in pattern recognition. In the KNN algorithm, the training instances of the dataset are extracted as data points in the feature space and divided into several separate classes. To predict the class of a new instance point, initially, it is evaluated in the proposed feature space. KNN is utilized by different researchers in [60, 62, 63, 65, 69, 76, 81].

Artificial neural networks (ANN) are motivated by the functioning of the excitatory or inhibitory neuron connections in the human brain. Multilayer perceptron (MLP) is a feed-forward network, with an input layer, an output layer, and one or more hidden layers. Usually, this network uses the backpropagation technique for training, where the error of prediction is propagated from the output layer to the input layer, modifying interconnection weights trained their models. ANN classifier is used in [21, 36, 37, 69, 76, 79]. Few authors also presented their experiential evaluation using extreme learning machine (ELM) classification [14, 76].

Random forest (RF) is a fast, highly precise, noise resistant ensemble classification algorithm. Random forest classification is utilized in [4, 10, 41, 69, 76]. Additionally, authors used Gaussian mixture model (GMM) [10], linear discriminant analysis (LDA) [73], and RANSAC [34] for audio classification task.

5 Discussions

The survey presents a systematic methodology to compose a comprehensive record of current research dynamics and algorithms focusing on time-frequency image texture features employed in audio applications. Initial attempts in the filed of time-frequency texture image algorithm development were primarily focused on spectrogram based music genre classification [5, 23,24,25,26,27, 43, 73, 74] and acoustic event classification. [9, 29, 30, 39]. Later, this trend is extended for the development of bird and animal sound detection techniques [41, 47, 52, 82, 85]. In addition to this, the application area spans different audio applications like language identification [21, 45], Chinese folk song recognition [78], speech emotion recognition [54], snore sound discrimination [28], speech-music classification [10, 14] and identification of infants cry [32]. Overall, the new time-frequency visualization and texture feature approach is found suitable and efficient in speech, music and audio applications.

Spectrogram image textures are most widely used (almost more than 70%) by the researchers for algorithm development (Fig. 3). Spectrogram representation efficiently characterizes and provides profound attributes present in an audio sample. Moreover, spectrogram image texture is identified in various applications in order to capture the relevant details. Apart from the spectrogram, CQT and cochleagram visualization are also utilized widely. It is also found that CQT representation is better suited for acoustic scene classification as evident from the Table 5. This might be because the CQT image is able to learn the sound texture from acoustic sound efficiently compared to other representations. Whereas, chromagram image texture descriptors are effective in music applications [15, 58].

Texture features play an important role in the classification. From the literature, it is found that, the local binary pattern is the most widely used descriptors for feature extraction from the time-frequency image. It is also evident from the studies that combining different textural descriptors attained superior classification performance as compared to individual feature. For example, GLCM and LBP [27], Gabor and LBP [73], HOG, LPQ, LBP, HARA, LCP, DENSE, WLD, RICLBP, CoALBP, NTLBP [48], LBP, μ LBP and RILBP [5], LBP and HoG [42], LBP, LTP and HASC [52], LBP variants, LCP, DENSE, HOG, WLD [50], LBP, LPQ, RICLBP, LBPHF, MLPQ, HASC, ELHF, GABOR [49], and LBP, RLBP and LPQ [85] .

Use of classification algorithm is dominated by support vector machine classifier in different audio application areas (Section 4). Although, some of the works presented evaluation utilizing an artificial neural network, random forest, and K-nearest neighbors classifiers [11,12,13]. Ensemble classifiers are also explored in [23, 77]. SVM showed good potential even in a noisy environment because of its robustness against such conditions.

It is important to note that, local feature extraction approach attained exceptional performance as compared to global extraction scheme. To accomplish this zoning technique is employed in which the time-frequency image is divided into different regions and then from each region descriptors are extracted [25, 47,48,49,50].

The major issue with all the algorithms is feature vector dimensionality. Different texture descriptors such as LBP, LPQ, and their variants generate large dimension vector. This hugely impacts of the training and testing times of the classifier in addition to the classification performance. However, feature selection is rarely addressed in the literature like PCA [46, 57], coefficient of variance [78], ReliefF [82], chaotic crow search algorithm [14], and GWO [21]. In future, it is worthy of exploring the effect of feature selection when a combination of textural descriptors are employed in the application development.

6 Conclusion

In this survey, a comprehensive overview of state-of-the-art research works on time-frequency texture image features in audio classification algorithms is presented. Firstly, we identified salient characteristics from the existing literature, and a generalized architecture of time-frequency texture feature extraction approach in audio classification algorithms is presented which we believe helps to new researchers in this area to comprehend overall composition. Later, key characteristics and categories of time-frequency visual representations are identified along with dominant texture feature extraction algorithms. Various time-frequency visualization algorithms in diverse audio applications are categorized and compared using their key aspects. A brief discussion of feature selection approaches utilized in several applications are also explored. Finally, some open research challenges and future trends in these fields are outlined.