Skip to main content

Humans Inside: Cooperative Big Multimedia Data Mining

  • Chapter
  • First Online:
Innovations in Big Data Mining and Embedded Knowledge

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 159))

Abstract

Deep learning techniques such as convolutional neural networks, autoencoders, and deep belief networks require a big amount of training data to achieve an optimal performance. Multimedia resources available on social media represent a wealth of data to satisfy this need. However, a prohibitive amount of effort is required to acquire and label such data and to process them. In this book chapter, we offer a threefold approach to tackle these issues: (1) we introduce a complex network analyser system for large-scale big data collection from online social media platforms, (2) we show the suitability of intelligent crowdsourcing and active learning approaches for effective labelling of large-scale data, and (3) we apply machine learning algorithms for extracting and learning meaningful representations from the collected data. From YouTube—the world’s largest video sharing website we have collected three databases containing a total number of 25 classes for which we have iterated thousands videos from a range of acoustic environments and human speech and vocalisation types. We show that, using the unique combination of our big data extraction and annotation systems with machine learning techniques, it is possible to create new real-world databases from social multimedia in a short amount of time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.ihearu-play.eu.

  2. 2.

    https://www.mturk.com.

  3. 3.

    https://www.crowdflower.com.

  4. 4.

    https://github.com/auDeep/auDeep.

  5. 5.

    https://github.com/DeepSpectrum/DeepSpectrum.

  6. 6.

    https://github.com/openXBOW/openXBOW.

  7. 7.

    https://commons.wikimedia.org/.

References

  1. (2018) Hours of video uploaded to YouTube every minute as of July 2015. https://www.statista.com/topics/2019/youtube. Accessed 5 Mar 2018

  2. Amiriparian, S., Cummins, N., Ottl, S., Gerczuk, M., Schuller, B.: Sentiment analysis using image-based deep spectrum features. In: Proceedings of the Biannual Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, pp. 26–29 (2017)

    Google Scholar 

  3. Amiriparian, S., Freitag, M., Cummins, N., Schuller, B.: Sequence to sequence autoencoders for unsupervised representation learning from audio. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events Challenge Workshop (DCASE), Munich, Germany, pp. 17–21 (2017)

    Google Scholar 

  4. Amiriparian, S., Gerczuk, M., Ottl, S., Cummins, N., Freitag, M., Pugachevskiy, S., Schuller, B.: Snore sound classification using image-based deep spectrum features. proceedings of INTERSPEECH, pp. 3512–3516. Stockholm, Sweden (2017)

    Google Scholar 

  5. Amiriparian, S., Pugachevskiy, S., Cummins, N., Hantke, S., Pohjalainen, J., Keren, G., Schuller, B.: CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms. In: Proceedings of the Biannual Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, pp. 340–345 (2017)

    Google Scholar 

  6. Amiriparian, S., Gerczuk, M., Ottl, S., Cummins, N., Pugachevskiy, S., Schuller, B.: Bag-of-Deep-Features: Noise-Robust Deep Feature Representations for Audio Analysis. In: Proceedings of 31st International Joint Conference on Neural Networks (IJCNN), IEEE, IEEE, Rio de Janeiro, Brazil, pp. 2419–2425 (2018)

    Google Scholar 

  7. Amiriparian, S., Cummins, N., Gerczuk, M., Pugachevskiy, S., Ottl, S., Schuller, B.: Are you playing a shooter again?!” deep representation learning for audio-based video game genre recognition. In: IEEE Transactions on Computational Intelligence and AI in Games PP, submitted, 10 p. (2018)

    Google Scholar 

  8. Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. In: Proceedings of 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. SIAM, New Orleans, USA (2007)

    Google Scholar 

  9. Battaglino, D., Lepauloux, L., Pilati, L., Evans, N.: Acoustic context recognition using local binary pattern codebooks. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5. IEEE (2015)

    Google Scholar 

  10. Brabham, D.C.: Crowdsourcing. Wiley Online Library (2013)

    Google Scholar 

  11. Braun, U., Muldoon, S.F., Bassett, D.S.: On Human Brain Networks in Health and Disease. eLS (2015)

    Google Scholar 

  12. Burmania, A., Abdelwahab M, Busso C (2016) Tradeoff between quality and quantity of emotional annotations to characterize expressive behaviors. In: Proceedings of ICASSP, Shanghai, China, pp. 5190–5194

    Google Scholar 

  13. Burmania, A., Parthasarathy, S., Busso, C.: Increasing the reliability of crowdsourcing evaluations using online quality assessment. IEEE Trans. Affect. Comput. 7(4), 374–388 (2016b)

    Article  Google Scholar 

  14. Costa, J., Silva, C., Antunes, M., Ribeiro, B.: On using crowdsourcing and active learning to improve classification performance. In: Proceedings of International Conference on Intelligent Systems Design and Applications, pp. 469–474. IEEE, Cordoba, Spain (2011)

    Google Scholar 

  15. Covington, P., Adams, J., Sargin, E.: Deep neural networks for youtube recommendations. In: Proceedings of the ACM Conference on Recommender Systems, pp. 191–198. ACM, New York, USA (2016)

    Google Scholar 

  16. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, pp. 248–255. IEEE (2009)

    Google Scholar 

  17. Eyben, F.: Real-time Speech and Music Classification by Large Audio Feature Space Extraction. Springer (2015)

    Google Scholar 

  18. Eyben, F., Wöllmer, M., Schuller, B.: openSMILE—the munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM International Conference on Multimedia (ACMMM), Florence, Italy, pp. 1459–1462 (2010)

    Google Scholar 

  19. Freitag, M., Amiriparian, S., Pugachevskiy, S., Cummins, N., Schuller, B.: audeep: Unsupervised learning of representations from audio with deep recurrent neural networks. J. Mach. Learn. Res. 18(1), 6340–6344 (2017)

    MathSciNet  MATH  Google Scholar 

  20. Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780 (2017)

    Google Scholar 

  21. Han, W., Coutinho, E., Ruan, H., Li, H., Schuller, B., Yu, X., Zhu, X.: Semi-Supervised active learning for sound classification in hybrid learning environments. PLoS One 11(9) (2016)

    Article  Google Scholar 

  22. Hantke, S., Eyben, F., Appel, T., et al.: iHEARu-PLAY: introducing a game for crowdsourced data collection for affective computing. In: Proceedings of the 1st International Workshop on Automatic Sentiment Analysis in the Wild (WASA 2015) Held Conjunction with 6th Biannual Conference on Affective Computing and Intelligent Interaction (ACII 2015), pp. 891–897. IEEE, Xi’an, PR China (2015)

    Google Scholar 

  23. Hantke, S., Marchi, E., Schuller, B.: Introducing the weighted trustability evaluator for crowdsourcing exemplified by speaker likability classification. In: Proceedings of the International Conference on Language Resources and Evaluation, Portoroz, Slovenia, pp. 2156–2161 (2016)

    Google Scholar 

  24. Hantke, S., Zhang, Z., Schuller, B.,: Towards intelligent crowdsourcing for audio data annotation: integrating active learning in the real world. In: Proceedings of Interspeech 18th Annual Conference of the International Speech Communication Association, pp. 3951–3955. ISCA, Stockholm, Sweden (2017)

    Google Scholar 

  25. Hantke, S., Abstreiter, A., Cummins, N., Schuller, B.: Trustability-based Dynamic Active Learning for Crowdsourced Labelling of Emotional Audio Data. IEEE Access 6, p. 12 (2018). to appear

    Article  Google Scholar 

  26. Hantke, S., Appel, T., Schuller, B.: The inclusion of gamification solutions to enhance user enjoyment on crowdsourcing platforms. In: Proceedings of the 1st Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia 2018), p. 6. IEEE, Beijing, P.R. China (2018)

    Google Scholar 

  27. Hantke, S., Stemp, C., Schuller, B.: Annotator Trustability-based Cooperative Learning Solutions for Intelligent Audio Analysis. In: Proceedings Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, p. 5. ISCA, Hyderabad, India (2018). to appear

    Google Scholar 

  28. Härmä, A., McKinney, M.F., Skowronek. J.: Automatic surveillance of the acoustic activity in our living environment. In: Proceedings of the International Conference on Multimedia and Expo. IEEE, Amsterdam, The Netherlands (2005). no pagination

    Google Scholar 

  29. Hsueh, P., Melville, P., Sindhwani, V.: Data quality from crowdsourcing: a study of annotation selection criteria. In: Proceedings of the NAACL HLT Workshop on Active Learning for Natural Language Processing, pp. 27–35. ACL, Boulder, USA (2009)

    Google Scholar 

  30. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. European Conference on Machine Learning, pp. 137–142. Springer, Chemnitz, Germany (1998)

    Google Scholar 

  31. Joshi, J., Goecke, R., Alghowinem, S., Dhall, A., Wagner, M., Epps, J., Parker, G., Breakspear, M.: Multimodal assistive technologies for depression diagnosis and monitoring. J. MultiModal User Interfaces 7(3), 217–228 (2013)

    Article  Google Scholar 

  32. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012)

    Google Scholar 

  33. Lakhani, K., Garvin, D., Lonstein, E.: Topcoder (a): Developing Software Through Crowdsourcing (2010)

    Google Scholar 

  34. McCallumzy, A.K., Nigamy K.: Employing em and pool-based active learning for text classification. In: Proceedings of Conference on Machine Learning, Madison, Wisconsin, pp. 359–367 (1998)

    Google Scholar 

  35. McFee, B., McVicar, M., Nieto, O., Balke, S., Thome, C., Liang, D., Battenberg, E., Moore, J., Bittner, R. Yamamoto, R., Ellis, D., Stoter, F.R., Repetto, D., Waloschek, S., Carr, C., Kranzler, S., Choi, K., Viktorin, P., Santos, J.F., Holovaty, A., Pimenta, W., Lee, H.: librosa 0.5.0 (2017)

    Google Scholar 

  36. Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. 24th European Signal Processing Conference (EUSIPCO 2016), pp. 1128–1132. IEEE, Budapest, Hungary (2016)

    Chapter  Google Scholar 

  37. Mesaros, A., Heittola, T., Diment, A., Elizalde, B., Shah, A., Vincent, E., Raj, B., Virtanen, T.: DCASE 2017 challenge setup: tasks, datasets and baseline system. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE2017). Munich, Germany (2017)

    Google Scholar 

  38. Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T., Plumbley, M.D.: Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26(2), 379–393 (2018)

    Article  Google Scholar 

  39. Morris, R., McDuff, D., Calvo, R.: Crowdsourcing techniques for affective computing. In: Calvo, R.A., D’Mello, S., Gratch, J., Kappas, A. (eds.) Handbook of Affective Computing, pp. 384–394. Oxford University Press, Oxford Library of Psychology (2015)

    Google Scholar 

  40. Najafabadi, M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N., Wald, R., Muharemagic, E.: Deep learning applications and challenges in big data analytics. J. Big Data 2(1), 1 (2015)

    Article  Google Scholar 

  41. Newman, M., Watts, D., Strogatz, S.: Random graph models of social networks. Proc. Natl. Acad. Sci. 99(suppl 1), 2566–2572 (2002)

    Article  Google Scholar 

  42. Pancoast, S., Akbacak, M.: Bag-of-audio-words approach for multimedia event classification. In: Proceedings of Interspeech: 13th Annual Conference of the International Speech Communication Association, pp. 2105–2108. ISCA, Portland, OR, USA (2012)

    Google Scholar 

  43. Pancoast, S., Akbacak, M.: Softening quantization in bag-of-audio-words. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1370–1374. IEEE (2014)

    Google Scholar 

  44. Panwar, S., Das, A., Roopaei, M., Rad, P.: A deep learning approach for mapping music genres. In: 2017 12th System of Systems Engineering Conference (SoSE), pp. 1–5. IEEE (2017)

    Google Scholar 

  45. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  46. Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)

    Article  Google Scholar 

  47. Piczak, K.J.: ESC: Dataset for environmental sound classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018. ACM (2015)

    Google Scholar 

  48. Romera-Paredes, B., Torr, P.: An embarrassingly simple approach to zero-shot learning. In: International Conference on Machine Learning, pp. 2152–2161 (2015)

    Google Scholar 

  49. Rubinov, M., Knock, S.A., Stam, C.J., Micheloyannis, S., Harris, A.W., Williams, L.M., Breakspear, M.: Small-world properties of nonlinear brain activity in schizophrenia. Hum. Brain Mapp. 30(2), 403–416 (2009)

    Article  Google Scholar 

  50. Salamon, J., Bello, J.P.: Unsupervised feature learning for urban sound classification. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 171–175. IEEE (2015)

    Google Scholar 

  51. Schmitt, M., Schuller, B.: openXBOW—introducing the passau open-source crossmodal bag-of-words toolkit. J. Mach. Learn. Res. 18(96), 1–5 (2017a)

    MathSciNet  MATH  Google Scholar 

  52. Schmitt, M., Schuller, B.: openxbow introducing the passau open-source crossmodal bag-of-words toolkit. J. Mach. Learn. Res. 18(96), 1–5 (2017b)

    MathSciNet  MATH  Google Scholar 

  53. Schmitt, M., Ringeval, F., Schuller, B.: At the border of acoustics and linguistics: bag-of-audio-words for the recognition of emotions in speech. In: Proceedings of Interspeech, San Francisco, CA, pp. 495–499 (2016)

    Google Scholar 

  54. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Res. Repos. (CoRR) (2014). arXiv:1409.1556

  55. Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through cross-modal transfer. In: Advances in Neural Information Processing Systems, pp. 935–943 (2013)

    Google Scholar 

  56. Strogatz, S.: Exploring complex networks. Nature 410(6825), 268–276 (2001)

    Article  Google Scholar 

  57. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015). IEEE, Boston, MA, USA (2015)

    Google Scholar 

  58. Tarasov, A., Delany, S.J., Cullen, C.: Using crowdsourcing for labelling emotional speech assets. In: Proceedings of the W3C Workshop on Emotion Markup Language (EmotionML), pp. 1–5. Springer, Paris, France (2010)

    Google Scholar 

  59. Tchorz, J., Wollermann, S., Husstedt, H.: Classification of environmental sounds for future hearing aid applications. In: Proceedings of the 28th Conference on Electronic Speech Signal Processing (ESSV 2017), Saarbrücken, Germany, pp. 294–299 (2017)

    Google Scholar 

  60. Tirilly, P., Claveau, V., Gros, P.: Language modeling for bag-of-visual words image categorization. In: Proceedings of the International Conference on Content-based Image and Video Retrieval, pp. 249–258. ACM, Niagara Falls, Canada (2008)

    Google Scholar 

  61. Valenti, M., Diment, A., Parascandolo, G., Squartini, S., Virtanen, T.: Dcase 2016 acoustic scene classification using convolutional neural networks. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), pp. 95–99 (2016)

    Google Scholar 

  62. Von Ahn, L.: Games with a purpose. Computer 39, 92–94 (2006)

    Article  Google Scholar 

  63. Watts, D., Strogatz, S.: Collective dynamics of small-world networks. Nature 393(6684), 440–442 (1998)

    Article  Google Scholar 

  64. Zhang, Z., Schuller, B.: Active learning by sparse instance tracking and classifier confidence in acoustic emotion recognition. In: Proceedings of Interspeech, pp. 362–365. ISCA, Portland, OR, USA (2012)

    Google Scholar 

  65. Zhang, Y., Coutinho, E., Zhang, Z., Quan, C., Schuller, B.: Dynamic active learning based on agreement and applied to emotion recognition in spoken interactions. In: Proceedings of International Conference on Multimodal Interaction, pp. 275–278. ACM, Seattle, USA (2015a)

    Google Scholar 

  66. Zhang, Z., Coutinho, E., Deng, J., Schuller, B.: Cooperative learning and its application to emotion recognition from speech. IEEE Trans. Audio Speech Lang. Process. 23(1), 115–126 (2015b)

    Google Scholar 

  67. Zhang, Z., Coutinho, E., Deng, J., Schuller, B.: Cooperative learning and its application to emotion recognition from speech. IEEE Trans. Audio Speech Lang. Process. 23, 115–126 (2015c)

    Google Scholar 

  68. Zhang, Z., Ringeval, F., Dong, B., Coutinho, E., Marchi, E., Schuller, B.: Enhanced semi-supervised learning for multimodal emotion recognition. In: Proceedings of ICASSP, pp 5185–5189. IEEE, Shanghai, P.R. China (2016a)

    Google Scholar 

  69. Zhang, Y., Zhou, Y., Shen, J., Schuller, B.: Semi-autonomous data enrichment based on cross-task labelling of missing targets for holistic speech analysis. In: Proceedings of 41st IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2016, pp. 6090–6094. IEEE, Shanghai, P.R. China (2016b)

    Google Scholar 

  70. Zhang, Z., Cummins, N., Schuller, B.: Advanced data exploitation in speech analysis—an overview. IEEE Signal Process. Mag. 34, 24 (2017)

    Article  Google Scholar 

  71. Zhu, X.: Semi-supervised learning literature survey. Technical report. TR 1530, Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI (2006)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the European Unions’s seventh Framework Programme under grant agreement No. 338164 (ERC StG iHEARu).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shahin Amiriparian .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Amiriparian, S., Schmitt, M., Hantke, S., Pandit, V., Schuller, B. (2019). Humans Inside: Cooperative Big Multimedia Data Mining. In: Esposito, A., Esposito, A., Jain, L. (eds) Innovations in Big Data Mining and Embedded Knowledge. Intelligent Systems Reference Library, vol 159. Springer, Cham. https://doi.org/10.1007/978-3-030-15939-9_12

Download citation

Publish with us

Policies and ethics