Skip to main content

Automated Spoken Language Identification Using Convolutional Neural Networks & Spectrograms

  • Conference paper
  • First Online:
Key Digital Trends Shaping the Future of Information and Management Science (ISMS 2022)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 671))

Included in the following conference series:

  • 232 Accesses

Abstract

The automated identification of spoken languages from the voice signals is attributed to automatic Language Identification (LID). Automated LID has many applications, including global customer support systems and voice-based user interfaces for different machines. The hundreds of different languages are popularly spoken around the world and learning of all languages is practically impossible for anyone. The machine learning methods have been used effectively for automation and translation of LID. However, machine learning-based automation of the LID process is heavily reliant on handcrafted feature engineering. The manual feature extraction process is subjective to individual expertise and prone to many deficiencies. The conventional feature extraction not only leads to significant delays in the development of automated LID systems but also leads to inaccurate and non-scalable systems. In this paper, a deep learning-based approach using spectrograms is proposed. The Convolutional Neural Networks (CNN) model is designed for the task of automatic language identification. The proposed model is trained on a dataset from VoxForge on the speech from five different languages, viz. Deutsche, Dutch, English, French, and Portuguese. For this study, evaluation measures like accuracy, precision, recall, and F1-score are used. The new proposed approach has been compared against traditional approaches as well as other existing deep learning approaches for LID. The proposed model outperforms its competitors with an average F1-score of above 0.9 and an accuracy of 91.5%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Zissman, M.A., Berkling, K.M.: Automatic language identification. Speech Commun. 35, 115–124 (2001). https://doi.org/10.1016/S0167-6393(00)00099-6

    Article  MATH  Google Scholar 

  2. Barnard, E., Cole, R.A.: Reviewing automatic language identification. IEEE Signal Process. Mag. 11, 33–41 (1994). https://doi.org/10.1109/79.317925

    Article  Google Scholar 

  3. Lewis, M., Paul, G., Simons, F., Fennig, C.D.: Ethnologue: languages of the world. Ethnologue 87–101 (2016). https://doi.org/10.2307/415492

  4. Hachman, M.: Battle of the digital assistants: cortana, siri, and google now. PCWorld 32, 13–20 (2014)

    Google Scholar 

  5. Tong, R., Ma, B., Zhu, D., Li, H., Chng, E.S.: Integrating acoustic, prosodic and phonotactic features for spoken language identification. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 1, pp. 205–208 (2006). https://doi.org/10.1109/ICASSP.2006.1659993

  6. Torres-Carrasquillo, P.A., Singer, E., Kohler, M.A., Greene, R.J., Reynolds, D.A., Deller, J.R.: Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In: International Conference on Acoustics, Speech, and Signal Processing 2002, pp. 89–92 (2002). 10.1.1.58.368

    Google Scholar 

  7. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015). https://doi.org/10.1016/j.neunet.2014.09.003

    Article  Google Scholar 

  8. Zissman, M.A.: Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4, 31–44 (1996). https://doi.org/10.1109/TSA.1996.481450

    Article  Google Scholar 

  9. Zissman, M.A.: Automatic language identification using Gaussian mixture and hidden Markov models. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 399–402 (1993). https://doi.org/10.1109/ICASSP.1993.319323

  10. Lippmann, R.P.: Speech recognition by machines and humans. Speech Commun. 22, 1–15 (1997). https://doi.org/10.1016/S0167-6393(97)00021-6

    Article  Google Scholar 

  11. House, A.S., Neuburg, E.P.: Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations. J. Acoust. Soc. Am. 62, 708–713 (1977). https://doi.org/10.1121/1.381582

  12. Hazen, T.J.: Segment-based automatic language identification. J. Acoust. Soc. Am. 101, 2323 (1997). https://doi.org/10.1121/1.418211

    Article  Google Scholar 

  13. Pellegrino, F., Andre-Obrecht, R.: Automatic language identification: an alternative approach to phonetic modelling. Signal Process. 80, 1231–1244 (2000). https://doi.org/10.1016/S0165-1684(00)00032-3

    Article  MATH  Google Scholar 

  14. Torres-Carrasquillo, P.A., et al.: The MITLL NIST LRE 2007 language recognition system. In: Proceedings of the Annual Conference of the International Speech Communication Association INTERSPEECH, pp. 719–722 (2008)

    Google Scholar 

  15. Torres-Carrasquillo, P.A., et al.: The MITLL NIST LRE 2009 language recognition system. In: ICASSP, International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 4994–4997 (2010). https://doi.org/10.1109/ICASSP.2010.5495080

  16. Singer, E., et al.: The MITLL NIST LRE 2011 language recognition system. In: ICASSP, IEEE International Conference on Acoustics, Speech, and Signal Processing - Proceedings, pp. 209–215 (2012)

    Google Scholar 

  17. Montavon, G.: Deep learning for spoken language identification. In: NIPS Workshop on deep Learning for Speech Recognition and Related Applications, pp. 1–4 (2009)

    Google Scholar 

  18. Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. ICASSP 6645–6649 (2013). https://doi.org/10.1109/ICASSP.2013.6638947

  19. Deng, L., Yu, D.: Deep convex net: a scalable architecture for speech pattern classification. In: Proceedings of the Annual Conference of the International Speech Communication Association INTERSPEECH, pp. 2285–2288 (2011)

    Google Scholar 

  20. Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio, Speech Lang. Process. 20, 30–42 (2012). https://doi.org/10.1109/TASL.2011.2134090

  21. Deng, L., et al.: Recent advances in deep learning for speech research at Microsoft. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 8604–8608 (2013). https://doi.org/10.1109/ICASSP.2013.6639345

  22. Hinton, G., et al.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 82–97 (2012). https://doi.org/10.1109/MSP.2012.2205597

  23. Lopez-Moreno, I., Gonzalez-Dominguez, J., Plchot, O., Martinez, D., Gonzalez-Rodriguez, J., Moreno, P.: Automatic language identification using deep neural networks. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 5337–5341 (2014). https://doi.org/10.1109/ICASSP.2014.6854622

  24. Gonzalez-Dominguez, J., Lopez-Moreno, I., Sak, H., Gonzalez-Rodriguez, J., Moreno, P.J.: Automatic Language Identification using Long Short-Term Memory Recurrent Neural Networks, Interspeech-2014, pp. 2155–2159 (2014)

    Google Scholar 

  25. Voxforge.org, Free Speech... Recognition (Linux, Windows and Mac) - voxforge.org (2006)

    Google Scholar 

  26. Sisodia, D.S., Nikhil, S., Kiran, G.S., Sathvik, P.: Ensemble learners for identification of spoken languages using mel frequency cepstral coefficients. In: 2nd International Conference on Data, Engineering and Applications (IDEA), pp. 1–5. IEEE (2020). https://doi.org/10.1109/IDEA49133.2020.9170720

  27. Shrawgi, H., Sisodia, D.S.: Convolution neural network model for predicting single guide RNA efficiency in CRISPR/Cas9 system. Chemomtr. Intell. Lab. Syst. 189, 149–154 (2019). https://doi.org/10.1016/j.chemolab.2019.04.008

  28. Sisodia, D.S., Agrawal, R.: Data imputation-based learning models for prediction of diabetes. In: 2020 International Conference on Decision Aid Sciences and Application (DASA), pp. 966–970. IEEE (2020). https://doi.org/10.1109/DASA51403.2020.9317070

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dilip Singh Sisodia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shrawgi, H., Sisodia, D.S., Gupta, P. (2023). Automated Spoken Language Identification Using Convolutional Neural Networks & Spectrograms. In: Garg, L., et al. Key Digital Trends Shaping the Future of Information and Management Science. ISMS 2022. Lecture Notes in Networks and Systems, vol 671. Springer, Cham. https://doi.org/10.1007/978-3-031-31153-6_14

Download citation

Publish with us

Policies and ethics