Keywords

1 Introduction

When understanding the speech production process in female and male, it can be noticed that formants of females are higher in frequency than those of their male counterparts and the spectrum of female voiced sounds are lower than in male sounds, since the spectrum usually decreases in amplitude with increasing the frequency. All these acoustic effects are caused by the production of speech. Therefore, it is possible to find gender-specific features represented in acoustic speech signals [1].

Due to the transgender phenomena that appeared in the last years, and the surgeries applied to the vocal cords to adapt with the new gender, it is essentially required to determine the gender of the human being, regardless of its new gender for security identification reasons.

Acoustic voiced sounds are generated through the vibration of the vocal cords that in turn generates the periodic behavior of voice. This oscillation in frequency is called pitch. The pitch feature and other frequency-related features used in this work have a clear relationship with gender, and especially the low frequencies that contain the most important speech properties are useful in ASR. It is important to mention that male speech has a low-frequency behavior compared to female speech. The pitch in particular depends on the glottis physical characteristics, which are mass and elasticity [2, 3]. Speech is always represented as a discrete signal x(n); therefore, the pitch represents the fundamental frequency (f0) where the signal is repeated. The inverse is the fundamental period (T0). Statistically, there are certain frequency intervals for men regarding each language, for example, the pitch of the Spanish men lay in the frequency interval, 50–300 Hz, and the Spanish women and children can reach to 500 Hz frequency [4].

Many researches have been published in the field of gender recognition, but none has studied the third gender’s psychological effect on voice. Through this study, we tried to analyze the effect of the third gender on the human voice, and this study is considered a new field of study, since there are no researches published under this problem statement. Through our work, we were able to recognize the third gender voice, but as can be seen from the results and discussion section, that we did not achieve high classification accuracies. The limitation of our work lays in the features extracted. More features need to be extracted and more related to the third gender. The other limitation is the need for a dataset that shows third gender voice samples that have been under vocal tract transformation surgery, to change the voice from one gender to the other, in order to study the original voice features in sound that can differ the third gender from the other two genders.

2 Related Work

Many researches have confirmed the existence of unique acoustic and physiologic features in each of the female and male voices that can be used to recognize male from female voice, but till now, they have not reached the required classification accuracy [5, 6]. In the past few years, gender recognition has gained the interest of many researchers.

In 2015, Faek [7] used the first four formants and twelve MFCC’s as features and used the SVM as a classifier. A special feature selection is used based on the frequency range that the feature represents. A total of 114 speech samples uttered in Kurdish language were used in this work. The model of two classes (adult males and adult females) of gender recognition reached 96% recognition accuracy.

In 2019, Shaqra et al. [8] designed four emotion recognition models to present the relationship between gender/ age and emotion. The results showed that when using the same classifier for all four models on each gender and on different age limits, the results were higher compared to the performance of the system on all samples without categorization. This proves that gender and age affect the emotion recognition accuracy for its direct effect on voice.

In 2019, Alkhawaldeh et al. [9] provided an analysis about the gender-related features in speech and experimented three feature selection algorithms to find the best features. Then studied different machine learning (ML) models with different theoretical background, to find the best gender-related ML models. The best result gained was 99.7% classification accuracy.

In 2019, Abdulsatar [10] proposed a two-part system. The first part was called pre-processing and feature extraction to select the best feature, which are the first four formant frequencies and twelve MFCCs used to extract relevant features to recognize the gender and K-NN for classification. The highest accuracy classification obtained was 66%.

In 2021, Kwasny [11] applied d-vector and x-vector as deep neural network (DNN)-based embedder architectures to gender classification experiments. Then applied a transfer learning-based training scheme with pre-training the embedder network. The best overall performance achieved gender recognition was 99.60%.

Many challenges were faced; one of the challenges was the psychological situation of the human being and its effect on voice. When the human being is psychologically unstable, his vowels will be differently announced when he is psychologically stable [8, 12]. The other challenge was dealing with similarity in children voices of age 3 to 7 years; therefore, children were avoided in the experiments applied to the method proposed in this work.

The limitation of the state of art works is avoiding working on the third gender (not female neither male). Through this work, many experiments were conducted on the third gender, which is considered the first paper that studies the difference between the three genders (female, male and others).

3 Materials and Methodology

  • Dataset

    The dataset utilized in this work is the common voice dataset version en_2637h_2021-07-21. The common voice dataset contains 60 different languages recorded in different percentages as 23% US English, 8% England English, 7% India and South Asia (India, Pakistan, Sri Lanka), 3% Australian English, 3% Canadian English, 2% Scottish English, 1% Irish English, 1% Southern African (South Africa, Zimbabwe, Namibia), 1% New Zealand English and many other languages.

    The common voice dataset was recorded by 75,879 different individual voices with different age limits. 6% of the individuals were younger than 19 years old that means 4552 individuals of age less than 18 have participated in recording this online public dataset and 94% of the dataset is recorded by individuals older than 18 years. All voices related to under legal age (<18) were neglected from our experiments, and only adult voices were included.

    The common voice dataset contains adult voice for three genders, 45% male, 15% female and 40% third gender [13]. The number of samples tested through the experiments of this work was 71,327 samples. A total of 32,098 (45%) of those samples were male voices, 10,699 (15%) of the samples were female voices and 28,530 (40%) of the samples were for other genders.

4 Framework

  • The block diagram of the proposed method is shown in Fig. 1. In the following three sections, the main steps of the proposed method will be illustrated and discussed in details.

    Fig. 1
    figure 1

    Block diagram of the method proposed in this work

  • Pre-processing

    All samples were segmented according to the ratio (0.05%) of the original signal, and the overlap ratio deployed in this work was (0.025%), which provides (50%) overlap, and the total number of segments generated will be (− 1), where n is the number of original segments.

  • Feature Extraction

    A feature is a measurable property established from the material being observed [14]. The most important aspect in feature extraction is extracting the most relevant features to the problem statement. In the case of this work, the features extracted were smoothness, pitch, the first two formants and spectral centroid variability (SCV) features, which are strongly related to human being gender, and were selected according to the experiments conducted.

    Smoothness is defined as the transaction of speech through air, as much as the speech was smooth as much as its transaction was slower, and when speech is rougher, the transaction of speech is faster [6]. The smoothness was calculated through two domains, first the time domain, the second, the spectral domain. Smoothness is calculated through Eq. (1 and 2) [15].

    $$ {\text{GV}}_t = \frac{1}{P}\sqrt {\mathop \sum \nolimits_{j = 1}^P \left( {{\text{var}}_t \left( j \right)} \right)^2 } $$
    (1)
    $$ {\text{GV}}_s = \frac{1}{N}\sqrt {\mathop \sum \nolimits_{i = 1}^N \left( {{\text{var}}_s \left( i \right)} \right)^2 } $$
    (2)

    where \({\text{var}}_t\) and \({\text{var}}_s\) represent the variances in time and spectral domain of the spectral feature, P is the dimension of the feature, N is the length in the time domain of the feature

  • Feature Selection

    In this work, the ANOVA feature selection algorithm was used to filter features ahead of constructing the decision tree, to remove all irrelevant features, then pass the selected features to the decision tree.

    Decision tree is used for classification purposes or used as a feature selection algorithm of type embedded, and also used in data mining and machine learning [16, 17].

    Most of the decision tree implementations in the previous state-of-the-art works such as ID3 [18], C4.5 [19] and CART [20] did not measure the importance of each feature regarding other classes and the final classification results. Therefore, in this work, we will determine the importance of each feature regarding each class. In this work, each feature will be weighted and the weight will be used in feature selection and finally in the decision tree construction. Feature weight will be calculated respectively, the feature with the highest weight will be selected as the root feature of the next layer and so on, the decision tree will be constructed.

    A filter ranking method was used through this work for three reasons. First, to filter the less relevant variables. Second, to benefit from the criteria of variable selection by order of the variable ranking techniques. Third and finally, their simplicity and good success are reported from online applications. A ranking criterion is used to score each variable, then a threshold is fixed through experiment, and used to remove variables below that threshold [14].

    Feature selection methods that are applied before classification are considered filter feature selection methods, that’s why ranking methods are considered filter methods. The main principle of feature selection methods is to select unique features that contain useful information of different classes in the dataset through using a basic property of that feature. This property is called feature relevance that measures the power of that feature classifying different classes [14, 21, 22]. The chi square and analysis of variance (ANOVA) statistical feature selection methods were used in this work to measure the independence of two selected features.

    The chi square feature selection method was calculated through Eq. (3) [23].

    $$ x_c^2 = \sum \frac{{\left( {O_i - E_i } \right)^2 }}{E_i } $$
    (3)

    where c is the degree of freedom, O(s) are the observed values, which are s number of values, and E(s) are the expected values.

    ANOVA that is developed by the statistician Ronald Fisher [24] is a set of statistical models and their related estimation procedures, like the variation among and between groups of features that are used to analyze differences between means. ANOVA is based on total variance law, where the variance observed in a specific variable is partitioned into attribute components to other sources of variation. ANOVA in the simplest form provides a statistical experiment of whether two or more feature groups means are equal, not as the t-test that involves two means only.

    The proportion of variance in ANOVA represented by a feature or groups of features can be found through Eq. (4).

    $$ {\text{Variance}} = { }\frac{{{\text{SST}}}}{{{\text{TotalSS}}}} $$
    (4)

    where the SST is the treatment sum of squares and the Total SS is the total sum of squares. As much as the higher ratio, the more groups of features can represent the data. In other words, the groups of features with higher proportion must be selected [25].

  • Feature Classification

    The backpropagation NN and the GMM classifiers were selected in this work, to classify the three genders.

5 Results and Discussion

The aim of the experiments deployed in this work was to test which of the features (smoothness, pitch, the first two formants and spectral centroid variability (SCV)) or group of features can act better in gender recognition with respect to each of the two classification algorithms which are backpropagation NN and the GMM classifiers. The experiments also aim to evaluate the performance of the new proposed feature selection algorithm with respect to feature and classifier.

  • Gender recognition results with respect to each feature and classifier

The four types of features will be tested individually with each of the two classifiers, after applying the feature selection algorithm that is proposed in this work.

Table 1 shows the female gender classification performance according to feature and classifier. The results show the outperformance of backpropagation NN on GMM through all types of features and also show that the first two formant features gained the highest classification accuracy in female gender speech recognition, more than the other three types of features.

Table 1 Female recognition results

Table 2 shows the male gender classification performance according to feature and classifier. The results show the outperformance of backpropagation NN on GMM through all types of features and also show that the pitch feature gained the highest classification accuracy in male gender speech recognition, more than the other three types of features.

Table 2 Male recognition results

Table 3 shows the third gender classification performance according to feature and classifier. The results show the outperformance of backpropagation NN on GMM through all types of features and also show that the pitch features gained the highest classification accuracy in third gender speech recognition, more than the other three types of features. But as a overall conclusion, the third gender was misclassified, mostly to the male gender, that is why the pitch feature outperformed other features like in male gender speech recognition, that is explained by the similar properties in the speech signal of the male and third gender. Because of the misclassification of the third gender, Table 3 shows the low classification accuracy gained with respect to the other two genders classification accuracy mentioned in Tables 1 and 2.

Table 3 Third gender recognition results
  • Gender recognition results with respect to best feature group and classifier

After going through all six combination possibilities of the four feature types with best features selected, it was found that the highest accuracy results gained were through deploying the pitch and the first two formants in speech, with respect to the backpropagation NN classifier as shown in Table 4. The highest accuracy gained was 74.87% with respect to all three genders. If the third gender was excluded from the experiments, the highest classification accuracy achieved to classify female and male genders (without the third gender) is 97.71% with respect to the backpropagation NN, and 91.03% with respect to the GMM classifier using the pitch and first two formants’ features.

Table 4 All genders’ recognition results

6 Conclusion

To design a system that can recognize age through the same setting was challenging, because age is related to language, and each language has a different range of frequencies for the male and female and children.

The similar frequency behavior between the male voices and third gender caused a lot of ambiguity to the system designed in this work, regardless of the strong gender related features extracted and the new designed feature selection method.

It is clearly noticed that the third gender recognition classification accuracies achieved are very low with respect to the other two genders, which proves two things, first there are some special properties in the transgender’s speech that differs them from other genders, but those properties are weak or were not extracted perfectly. Second, a lot of the third gender speeches were misclassified as male voices and vice versa that led to the low accuracy classification in both genders, the male and the third gender, which either is explained that the original gender of the transgenders was a male gender, and the original speech properties retain in the speech of the transgenders regardless of the new gender selected willingly.