Skip to main content

Multi-lingual Author Profiling: Predicting Gender and Age from Tweets!

  • Conference paper
  • First Online:
Image Processing and Capsule Networks (ICIPCN 2020)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1200))

Included in the following conference series:

Abstract

This article describes how we build a multi-lingual classification system for author profiling. We have used Twitter corpus for English, Dutch, Italian and Spanish languages for building different models incorporating SVM classifier that predicts the gender and age of an author. We evaluated each model using 3-fold cross-validation on the training dataset for each of these languages. The overall maximum average accuracy for gender classification was 81.3% for Spanish while for classification of age we achieved a maximum accuracy score of 70.3% for English using the cross-validation scheme. For other languages, the results were between 64–76%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://pypi.python.org/pypi/stop-words.

  2. 2.

    Source scikit-learn - http://scikit-learn.org/.

References

  1. Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. In: CLEF Conference on Multilingual and Multimodal Information Access Evaluation, pp. 352–365. CELCT (2013)

    Google Scholar 

  2. Rangel, F., Rosso, P., Chugur, I., Potthast, M., Trenkmann, M., Stein, B., Verhoeven, B., Daelemans, W.: Overview of the 2nd author profiling task at PAN 2014. In: CLEF 2014 Evaluation Labs and Workshop Working Notes Papers, Sheffield, UK, 2014, pp. 1–30 (2014)

    Google Scholar 

  3. Malmasi, S., Zampieri, M., Ljubešić, N., Nakov, P., Ali, A., Tiedemann, J.: Discriminating between similar languages and Arabic dialect identification: a report on the third DSL shared task. In: Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pp. 1–14 (2016)

    Google Scholar 

  4. Rangel, F., Rosso, P., Montes-y Gómez, M., Potthast, M., Stein, B.: Overview of the 6th author profiling task at PAN 2018: multimodal gender identification in Twitter. Working Notes Papers of the CLEF (2018)

    Google Scholar 

  5. Tellez, E.S., Miranda-Jiménez, S., Moctezuma, D., Graff, M., Salgado, V., Ortiz-Bejar, J.: Gender identification through multi-modal tweet analysis using microtc and bag of visual words. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018) (2018)

    Google Scholar 

  6. Daneshvar, S., Inkpen, D.: Gender identification in Twitter using n-grams and LSA. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018) (2018)

    Google Scholar 

  7. Nieuwenhuis, M., Wilkens, J.: Twitter text and image gender classification with a logistic regression n-gram model. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018) (2018)

    Google Scholar 

  8. Rangel, F., Rosso, P., Verhoeven, B., Daelemans, W., Potthast, M., Stein, B.: Overview of the 4th author profiling task at PAN 2016: cross-genre evaluations. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings/Balog, Krisztian [edit.]; et al, pp. 750–784 (2016)

    Google Scholar 

  9. Aragón, M.E., López-Monroy, A.P.: A straightforward multimodal approach for author profiling. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018) (2018)

    Google Scholar 

  10. López-Santillán, R., Gonzalez-Gurrola, L., Ramfrez-Alonso, G.: Custom document embeddings via the centroids method: gender classification in an author profiling task. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018), vol. 2125 (2018)

    Google Scholar 

  11. Ciccone, G., Sultan, A., Laporte, L., Egyed-Zsigmond, E., Alhamzeh, A.,Granitzer, M.: Stacked gender prediction from tweet texts and images note book for PAN at CLEF 2018 (2018)

    Google Scholar 

  12. Patra, B.G., Das, K.G., Das, D.: Multimodal author profiling for Twitter. Notebook for PAN at CLEF (2018)

    Google Scholar 

  13. Veenhoven, R., Snijders, S., van der Hall, D., van Noord, R.: Using translated data to improve deep learning author profiling models. In: Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018), vol. 2125 (2018)

    Google Scholar 

  14. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019)

    Google Scholar 

  15. Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H., Nissim, M.: Simply the best: minimalist system trumps complex models in author profiling. In: International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 143–156. Springer, Cham (2018)

    Google Scholar 

  16. Rangel Pardo, F.M., Celli, F., Rosso, P., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN 2015. In: CLEF 2015 Evaluation Labs and Workshop Working Notes Papers, pp. 1–8 (2015)

    Google Scholar 

  17. Rangel, F., Rosso, P., Potthast, M., Stein, B.: Overview of the 5th author profiling task at PAN 2017: gender and language variety identification in twitter. Working Notes Papers of the CLEF (2017)

    Google Scholar 

  18. Loper, E., Bird, S.: NLTK: the natural language toolkit. arXiv preprint cs/0205028 (2002)

    Google Scholar 

  19. Porter, M.F., et al.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  20. Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  21. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yeasmin Ara Akter .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rahman, M.A., Akter, Y.A. (2021). Multi-lingual Author Profiling: Predicting Gender and Age from Tweets!. In: Chen, J.IZ., Tavares, J.M.R.S., Shakya, S., Iliyasu, A.M. (eds) Image Processing and Capsule Networks. ICIPCN 2020. Advances in Intelligent Systems and Computing, vol 1200. Springer, Cham. https://doi.org/10.1007/978-3-030-51859-2_46

Download citation

Publish with us

Policies and ethics