Abstract
Deep learning techniques such as convolutional neural networks, autoencoders, and deep belief networks require a big amount of training data to achieve an optimal performance. Multimedia resources available on social media represent a wealth of data to satisfy this need. However, a prohibitive amount of effort is required to acquire and label such data and to process them. In this book chapter, we offer a threefold approach to tackle these issues: (1) we introduce a complex network analyser system for large-scale big data collection from online social media platforms, (2) we show the suitability of intelligent crowdsourcing and active learning approaches for effective labelling of large-scale data, and (3) we apply machine learning algorithms for extracting and learning meaningful representations from the collected data. From YouTube—the world’s largest video sharing website we have collected three databases containing a total number of 25 classes for which we have iterated thousands videos from a range of acoustic environments and human speech and vocalisation types. We show that, using the unique combination of our big data extraction and annotation systems with machine learning techniques, it is possible to create new real-world databases from social multimedia in a short amount of time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
(2018) Hours of video uploaded to YouTube every minute as of July 2015. https://www.statista.com/topics/2019/youtube. Accessed 5 Mar 2018
Amiriparian, S., Cummins, N., Ottl, S., Gerczuk, M., Schuller, B.: Sentiment analysis using image-based deep spectrum features. In: Proceedings of the Biannual Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, pp. 26–29 (2017)
Amiriparian, S., Freitag, M., Cummins, N., Schuller, B.: Sequence to sequence autoencoders for unsupervised representation learning from audio. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events Challenge Workshop (DCASE), Munich, Germany, pp. 17–21 (2017)
Amiriparian, S., Gerczuk, M., Ottl, S., Cummins, N., Freitag, M., Pugachevskiy, S., Schuller, B.: Snore sound classification using image-based deep spectrum features. proceedings of INTERSPEECH, pp. 3512–3516. Stockholm, Sweden (2017)
Amiriparian, S., Pugachevskiy, S., Cummins, N., Hantke, S., Pohjalainen, J., Keren, G., Schuller, B.: CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms. In: Proceedings of the Biannual Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, pp. 340–345 (2017)
Amiriparian, S., Gerczuk, M., Ottl, S., Cummins, N., Pugachevskiy, S., Schuller, B.: Bag-of-Deep-Features: Noise-Robust Deep Feature Representations for Audio Analysis. In: Proceedings of 31st International Joint Conference on Neural Networks (IJCNN), IEEE, IEEE, Rio de Janeiro, Brazil, pp. 2419–2425 (2018)
Amiriparian, S., Cummins, N., Gerczuk, M., Pugachevskiy, S., Ottl, S., Schuller, B.: Are you playing a shooter again?!” deep representation learning for audio-based video game genre recognition. In: IEEE Transactions on Computational Intelligence and AI in Games PP, submitted, 10 p. (2018)
Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. In: Proceedings of 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. SIAM, New Orleans, USA (2007)
Battaglino, D., Lepauloux, L., Pilati, L., Evans, N.: Acoustic context recognition using local binary pattern codebooks. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5. IEEE (2015)
Brabham, D.C.: Crowdsourcing. Wiley Online Library (2013)
Braun, U., Muldoon, S.F., Bassett, D.S.: On Human Brain Networks in Health and Disease. eLS (2015)
Burmania, A., Abdelwahab M, Busso C (2016) Tradeoff between quality and quantity of emotional annotations to characterize expressive behaviors. In: Proceedings of ICASSP, Shanghai, China, pp. 5190–5194
Burmania, A., Parthasarathy, S., Busso, C.: Increasing the reliability of crowdsourcing evaluations using online quality assessment. IEEE Trans. Affect. Comput. 7(4), 374–388 (2016b)
Costa, J., Silva, C., Antunes, M., Ribeiro, B.: On using crowdsourcing and active learning to improve classification performance. In: Proceedings of International Conference on Intelligent Systems Design and Applications, pp. 469–474. IEEE, Cordoba, Spain (2011)
Covington, P., Adams, J., Sargin, E.: Deep neural networks for youtube recommendations. In: Proceedings of the ACM Conference on Recommender Systems, pp. 191–198. ACM, New York, USA (2016)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, pp. 248–255. IEEE (2009)
Eyben, F.: Real-time Speech and Music Classification by Large Audio Feature Space Extraction. Springer (2015)
Eyben, F., Wöllmer, M., Schuller, B.: openSMILE—the munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM International Conference on Multimedia (ACMMM), Florence, Italy, pp. 1459–1462 (2010)
Freitag, M., Amiriparian, S., Pugachevskiy, S., Cummins, N., Schuller, B.: audeep: Unsupervised learning of representations from audio with deep recurrent neural networks. J. Mach. Learn. Res. 18(1), 6340–6344 (2017)
Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780 (2017)
Han, W., Coutinho, E., Ruan, H., Li, H., Schuller, B., Yu, X., Zhu, X.: Semi-Supervised active learning for sound classification in hybrid learning environments. PLoS One 11(9) (2016)
Hantke, S., Eyben, F., Appel, T., et al.: iHEARu-PLAY: introducing a game for crowdsourced data collection for affective computing. In: Proceedings of the 1st International Workshop on Automatic Sentiment Analysis in the Wild (WASA 2015) Held Conjunction with 6th Biannual Conference on Affective Computing and Intelligent Interaction (ACII 2015), pp. 891–897. IEEE, Xi’an, PR China (2015)
Hantke, S., Marchi, E., Schuller, B.: Introducing the weighted trustability evaluator for crowdsourcing exemplified by speaker likability classification. In: Proceedings of the International Conference on Language Resources and Evaluation, Portoroz, Slovenia, pp. 2156–2161 (2016)
Hantke, S., Zhang, Z., Schuller, B.,: Towards intelligent crowdsourcing for audio data annotation: integrating active learning in the real world. In: Proceedings of Interspeech 18th Annual Conference of the International Speech Communication Association, pp. 3951–3955. ISCA, Stockholm, Sweden (2017)
Hantke, S., Abstreiter, A., Cummins, N., Schuller, B.: Trustability-based Dynamic Active Learning for Crowdsourced Labelling of Emotional Audio Data. IEEE Access 6, p. 12 (2018). to appear
Hantke, S., Appel, T., Schuller, B.: The inclusion of gamification solutions to enhance user enjoyment on crowdsourcing platforms. In: Proceedings of the 1st Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia 2018), p. 6. IEEE, Beijing, P.R. China (2018)
Hantke, S., Stemp, C., Schuller, B.: Annotator Trustability-based Cooperative Learning Solutions for Intelligent Audio Analysis. In: Proceedings Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, p. 5. ISCA, Hyderabad, India (2018). to appear
Härmä, A., McKinney, M.F., Skowronek. J.: Automatic surveillance of the acoustic activity in our living environment. In: Proceedings of the International Conference on Multimedia and Expo. IEEE, Amsterdam, The Netherlands (2005). no pagination
Hsueh, P., Melville, P., Sindhwani, V.: Data quality from crowdsourcing: a study of annotation selection criteria. In: Proceedings of the NAACL HLT Workshop on Active Learning for Natural Language Processing, pp. 27–35. ACL, Boulder, USA (2009)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. European Conference on Machine Learning, pp. 137–142. Springer, Chemnitz, Germany (1998)
Joshi, J., Goecke, R., Alghowinem, S., Dhall, A., Wagner, M., Epps, J., Parker, G., Breakspear, M.: Multimodal assistive technologies for depression diagnosis and monitoring. J. MultiModal User Interfaces 7(3), 217–228 (2013)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012)
Lakhani, K., Garvin, D., Lonstein, E.: Topcoder (a): Developing Software Through Crowdsourcing (2010)
McCallumzy, A.K., Nigamy K.: Employing em and pool-based active learning for text classification. In: Proceedings of Conference on Machine Learning, Madison, Wisconsin, pp. 359–367 (1998)
McFee, B., McVicar, M., Nieto, O., Balke, S., Thome, C., Liang, D., Battenberg, E., Moore, J., Bittner, R. Yamamoto, R., Ellis, D., Stoter, F.R., Repetto, D., Waloschek, S., Carr, C., Kranzler, S., Choi, K., Viktorin, P., Santos, J.F., Holovaty, A., Pimenta, W., Lee, H.: librosa 0.5.0 (2017)
Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. 24th European Signal Processing Conference (EUSIPCO 2016), pp. 1128–1132. IEEE, Budapest, Hungary (2016)
Mesaros, A., Heittola, T., Diment, A., Elizalde, B., Shah, A., Vincent, E., Raj, B., Virtanen, T.: DCASE 2017 challenge setup: tasks, datasets and baseline system. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE2017). Munich, Germany (2017)
Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T., Plumbley, M.D.: Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26(2), 379–393 (2018)
Morris, R., McDuff, D., Calvo, R.: Crowdsourcing techniques for affective computing. In: Calvo, R.A., D’Mello, S., Gratch, J., Kappas, A. (eds.) Handbook of Affective Computing, pp. 384–394. Oxford University Press, Oxford Library of Psychology (2015)
Najafabadi, M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N., Wald, R., Muharemagic, E.: Deep learning applications and challenges in big data analytics. J. Big Data 2(1), 1 (2015)
Newman, M., Watts, D., Strogatz, S.: Random graph models of social networks. Proc. Natl. Acad. Sci. 99(suppl 1), 2566–2572 (2002)
Pancoast, S., Akbacak, M.: Bag-of-audio-words approach for multimedia event classification. In: Proceedings of Interspeech: 13th Annual Conference of the International Speech Communication Association, pp. 2105–2108. ISCA, Portland, OR, USA (2012)
Pancoast, S., Akbacak, M.: Softening quantization in bag-of-audio-words. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1370–1374. IEEE (2014)
Panwar, S., Das, A., Roopaei, M., Rad, P.: A deep learning approach for mapping music genres. In: 2017 12th System of Systems Engineering Conference (SoSE), pp. 1–5. IEEE (2017)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Piczak, K.J.: ESC: Dataset for environmental sound classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018. ACM (2015)
Romera-Paredes, B., Torr, P.: An embarrassingly simple approach to zero-shot learning. In: International Conference on Machine Learning, pp. 2152–2161 (2015)
Rubinov, M., Knock, S.A., Stam, C.J., Micheloyannis, S., Harris, A.W., Williams, L.M., Breakspear, M.: Small-world properties of nonlinear brain activity in schizophrenia. Hum. Brain Mapp. 30(2), 403–416 (2009)
Salamon, J., Bello, J.P.: Unsupervised feature learning for urban sound classification. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 171–175. IEEE (2015)
Schmitt, M., Schuller, B.: openXBOW—introducing the passau open-source crossmodal bag-of-words toolkit. J. Mach. Learn. Res. 18(96), 1–5 (2017a)
Schmitt, M., Schuller, B.: openxbow introducing the passau open-source crossmodal bag-of-words toolkit. J. Mach. Learn. Res. 18(96), 1–5 (2017b)
Schmitt, M., Ringeval, F., Schuller, B.: At the border of acoustics and linguistics: bag-of-audio-words for the recognition of emotions in speech. In: Proceedings of Interspeech, San Francisco, CA, pp. 495–499 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Res. Repos. (CoRR) (2014). arXiv:1409.1556
Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through cross-modal transfer. In: Advances in Neural Information Processing Systems, pp. 935–943 (2013)
Strogatz, S.: Exploring complex networks. Nature 410(6825), 268–276 (2001)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015). IEEE, Boston, MA, USA (2015)
Tarasov, A., Delany, S.J., Cullen, C.: Using crowdsourcing for labelling emotional speech assets. In: Proceedings of the W3C Workshop on Emotion Markup Language (EmotionML), pp. 1–5. Springer, Paris, France (2010)
Tchorz, J., Wollermann, S., Husstedt, H.: Classification of environmental sounds for future hearing aid applications. In: Proceedings of the 28th Conference on Electronic Speech Signal Processing (ESSV 2017), Saarbrücken, Germany, pp. 294–299 (2017)
Tirilly, P., Claveau, V., Gros, P.: Language modeling for bag-of-visual words image categorization. In: Proceedings of the International Conference on Content-based Image and Video Retrieval, pp. 249–258. ACM, Niagara Falls, Canada (2008)
Valenti, M., Diment, A., Parascandolo, G., Squartini, S., Virtanen, T.: Dcase 2016 acoustic scene classification using convolutional neural networks. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), pp. 95–99 (2016)
Von Ahn, L.: Games with a purpose. Computer 39, 92–94 (2006)
Watts, D., Strogatz, S.: Collective dynamics of small-world networks. Nature 393(6684), 440–442 (1998)
Zhang, Z., Schuller, B.: Active learning by sparse instance tracking and classifier confidence in acoustic emotion recognition. In: Proceedings of Interspeech, pp. 362–365. ISCA, Portland, OR, USA (2012)
Zhang, Y., Coutinho, E., Zhang, Z., Quan, C., Schuller, B.: Dynamic active learning based on agreement and applied to emotion recognition in spoken interactions. In: Proceedings of International Conference on Multimodal Interaction, pp. 275–278. ACM, Seattle, USA (2015a)
Zhang, Z., Coutinho, E., Deng, J., Schuller, B.: Cooperative learning and its application to emotion recognition from speech. IEEE Trans. Audio Speech Lang. Process. 23(1), 115–126 (2015b)
Zhang, Z., Coutinho, E., Deng, J., Schuller, B.: Cooperative learning and its application to emotion recognition from speech. IEEE Trans. Audio Speech Lang. Process. 23, 115–126 (2015c)
Zhang, Z., Ringeval, F., Dong, B., Coutinho, E., Marchi, E., Schuller, B.: Enhanced semi-supervised learning for multimodal emotion recognition. In: Proceedings of ICASSP, pp 5185–5189. IEEE, Shanghai, P.R. China (2016a)
Zhang, Y., Zhou, Y., Shen, J., Schuller, B.: Semi-autonomous data enrichment based on cross-task labelling of missing targets for holistic speech analysis. In: Proceedings of 41st IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2016, pp. 6090–6094. IEEE, Shanghai, P.R. China (2016b)
Zhang, Z., Cummins, N., Schuller, B.: Advanced data exploitation in speech analysis—an overview. IEEE Signal Process. Mag. 34, 24 (2017)
Zhu, X.: Semi-supervised learning literature survey. Technical report. TR 1530, Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI (2006)
Acknowledgements
This work was supported by the European Unions’s seventh Framework Programme under grant agreement No. 338164 (ERC StG iHEARu).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Amiriparian, S., Schmitt, M., Hantke, S., Pandit, V., Schuller, B. (2019). Humans Inside: Cooperative Big Multimedia Data Mining. In: Esposito, A., Esposito, A., Jain, L. (eds) Innovations in Big Data Mining and Embedded Knowledge. Intelligent Systems Reference Library, vol 159. Springer, Cham. https://doi.org/10.1007/978-3-030-15939-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-15939-9_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-15938-2
Online ISBN: 978-3-030-15939-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)