Humans Inside: Cooperative Big Multimedia Data Mining

Amiriparian, Shahin; Schmitt, Maximilian; Hantke, Simone; Pandit, Vedhas; Schuller, Björn

doi:10.1007/978-3-030-15939-9_12

Shahin Amiriparian^6,7,
Maximilian Schmitt⁶,
Simone Hantke^6,7,
Vedhas Pandit⁶ &
…
Björn Schuller^6,8

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 159))

513 Accesses
1 Citations

Abstract

Deep learning techniques such as convolutional neural networks, autoencoders, and deep belief networks require a big amount of training data to achieve an optimal performance. Multimedia resources available on social media represent a wealth of data to satisfy this need. However, a prohibitive amount of effort is required to acquire and label such data and to process them. In this book chapter, we offer a threefold approach to tackle these issues: (1) we introduce a complex network analyser system for large-scale big data collection from online social media platforms, (2) we show the suitability of intelligent crowdsourcing and active learning approaches for effective labelling of large-scale data, and (3) we apply machine learning algorithms for extracting and learning meaningful representations from the collected data. From YouTube—the world’s largest video sharing website we have collected three databases containing a total number of 25 classes for which we have iterated thousands videos from a range of acoustic environments and human speech and vocalisation types. We show that, using the unique combination of our big data extraction and annotation systems with machine learning techniques, it is possible to create new real-world databases from social multimedia in a short amount of time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Data Mining Techniques for Videos Subscribers of Google YouTube

Multimedia Social Big Data: Mining

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

Article 22 May 2017

Notes

References

(2018) Hours of video uploaded to YouTube every minute as of July 2015. https://www.statista.com/topics/2019/youtube. Accessed 5 Mar 2018
Amiriparian, S., Cummins, N., Ottl, S., Gerczuk, M., Schuller, B.: Sentiment analysis using image-based deep spectrum features. In: Proceedings of the Biannual Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, pp. 26–29 (2017)
Google Scholar
Amiriparian, S., Freitag, M., Cummins, N., Schuller, B.: Sequence to sequence autoencoders for unsupervised representation learning from audio. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events Challenge Workshop (DCASE), Munich, Germany, pp. 17–21 (2017)
Google Scholar
Amiriparian, S., Gerczuk, M., Ottl, S., Cummins, N., Freitag, M., Pugachevskiy, S., Schuller, B.: Snore sound classification using image-based deep spectrum features. proceedings of INTERSPEECH, pp. 3512–3516. Stockholm, Sweden (2017)
Google Scholar
Amiriparian, S., Pugachevskiy, S., Cummins, N., Hantke, S., Pohjalainen, J., Keren, G., Schuller, B.: CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms. In: Proceedings of the Biannual Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, pp. 340–345 (2017)
Google Scholar
Amiriparian, S., Gerczuk, M., Ottl, S., Cummins, N., Pugachevskiy, S., Schuller, B.: Bag-of-Deep-Features: Noise-Robust Deep Feature Representations for Audio Analysis. In: Proceedings of 31st International Joint Conference on Neural Networks (IJCNN), IEEE, IEEE, Rio de Janeiro, Brazil, pp. 2419–2425 (2018)
Google Scholar
Amiriparian, S., Cummins, N., Gerczuk, M., Pugachevskiy, S., Ottl, S., Schuller, B.: Are you playing a shooter again?!” deep representation learning for audio-based video game genre recognition. In: IEEE Transactions on Computational Intelligence and AI in Games PP, submitted, 10 p. (2018)
Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding. In: Proceedings of 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. SIAM, New Orleans, USA (2007)
Google Scholar
Battaglino, D., Lepauloux, L., Pilati, L., Evans, N.: Acoustic context recognition using local binary pattern codebooks. In: 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5. IEEE (2015)
Google Scholar
Brabham, D.C.: Crowdsourcing. Wiley Online Library (2013)
Google Scholar
Braun, U., Muldoon, S.F., Bassett, D.S.: On Human Brain Networks in Health and Disease. eLS (2015)
Google Scholar
Burmania, A., Abdelwahab M, Busso C (2016) Tradeoff between quality and quantity of emotional annotations to characterize expressive behaviors. In: Proceedings of ICASSP, Shanghai, China, pp. 5190–5194
Google Scholar
Burmania, A., Parthasarathy, S., Busso, C.: Increasing the reliability of crowdsourcing evaluations using online quality assessment. IEEE Trans. Affect. Comput. 7(4), 374–388 (2016b)
Article Google Scholar
Costa, J., Silva, C., Antunes, M., Ribeiro, B.: On using crowdsourcing and active learning to improve classification performance. In: Proceedings of International Conference on Intelligent Systems Design and Applications, pp. 469–474. IEEE, Cordoba, Spain (2011)
Google Scholar
Covington, P., Adams, J., Sargin, E.: Deep neural networks for youtube recommendations. In: Proceedings of the ACM Conference on Recommender Systems, pp. 191–198. ACM, New York, USA (2016)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, pp. 248–255. IEEE (2009)
Google Scholar
Eyben, F.: Real-time Speech and Music Classification by Large Audio Feature Space Extraction. Springer (2015)
Google Scholar
Eyben, F., Wöllmer, M., Schuller, B.: openSMILE—the munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM International Conference on Multimedia (ACMMM), Florence, Italy, pp. 1459–1462 (2010)
Google Scholar
Freitag, M., Amiriparian, S., Pugachevskiy, S., Cummins, N., Schuller, B.: audeep: Unsupervised learning of representations from audio with deep recurrent neural networks. J. Mach. Learn. Res. 18(1), 6340–6344 (2017)
MathSciNet MATH Google Scholar
Gemmeke, J.F., Ellis, D.P.W., Freedman, D., Jansen, A., Lawrence, W., Moore, R.C., Plakal, M., Ritter, M.: Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780 (2017)
Google Scholar
Han, W., Coutinho, E., Ruan, H., Li, H., Schuller, B., Yu, X., Zhu, X.: Semi-Supervised active learning for sound classification in hybrid learning environments. PLoS One 11(9) (2016)
Article Google Scholar
Hantke, S., Eyben, F., Appel, T., et al.: iHEARu-PLAY: introducing a game for crowdsourced data collection for affective computing. In: Proceedings of the 1st International Workshop on Automatic Sentiment Analysis in the Wild (WASA 2015) Held Conjunction with 6th Biannual Conference on Affective Computing and Intelligent Interaction (ACII 2015), pp. 891–897. IEEE, Xi’an, PR China (2015)
Google Scholar
Hantke, S., Marchi, E., Schuller, B.: Introducing the weighted trustability evaluator for crowdsourcing exemplified by speaker likability classification. In: Proceedings of the International Conference on Language Resources and Evaluation, Portoroz, Slovenia, pp. 2156–2161 (2016)
Google Scholar
Hantke, S., Zhang, Z., Schuller, B.,: Towards intelligent crowdsourcing for audio data annotation: integrating active learning in the real world. In: Proceedings of Interspeech 18th Annual Conference of the International Speech Communication Association, pp. 3951–3955. ISCA, Stockholm, Sweden (2017)
Google Scholar
Hantke, S., Abstreiter, A., Cummins, N., Schuller, B.: Trustability-based Dynamic Active Learning for Crowdsourced Labelling of Emotional Audio Data. IEEE Access 6, p. 12 (2018). to appear
Article Google Scholar
Hantke, S., Appel, T., Schuller, B.: The inclusion of gamification solutions to enhance user enjoyment on crowdsourcing platforms. In: Proceedings of the 1st Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia 2018), p. 6. IEEE, Beijing, P.R. China (2018)
Google Scholar
Hantke, S., Stemp, C., Schuller, B.: Annotator Trustability-based Cooperative Learning Solutions for Intelligent Audio Analysis. In: Proceedings Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, p. 5. ISCA, Hyderabad, India (2018). to appear
Google Scholar
Härmä, A., McKinney, M.F., Skowronek. J.: Automatic surveillance of the acoustic activity in our living environment. In: Proceedings of the International Conference on Multimedia and Expo. IEEE, Amsterdam, The Netherlands (2005). no pagination
Google Scholar
Hsueh, P., Melville, P., Sindhwani, V.: Data quality from crowdsourcing: a study of annotation selection criteria. In: Proceedings of the NAACL HLT Workshop on Active Learning for Natural Language Processing, pp. 27–35. ACL, Boulder, USA (2009)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. European Conference on Machine Learning, pp. 137–142. Springer, Chemnitz, Germany (1998)
Google Scholar
Joshi, J., Goecke, R., Alghowinem, S., Dhall, A., Wagner, M., Epps, J., Parker, G., Breakspear, M.: Multimodal assistive technologies for depression diagnosis and monitoring. J. MultiModal User Interfaces 7(3), 217–228 (2013)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012)
Google Scholar
Lakhani, K., Garvin, D., Lonstein, E.: Topcoder (a): Developing Software Through Crowdsourcing (2010)
Google Scholar
McCallumzy, A.K., Nigamy K.: Employing em and pool-based active learning for text classification. In: Proceedings of Conference on Machine Learning, Madison, Wisconsin, pp. 359–367 (1998)
Google Scholar
McFee, B., McVicar, M., Nieto, O., Balke, S., Thome, C., Liang, D., Battenberg, E., Moore, J., Bittner, R. Yamamoto, R., Ellis, D., Stoter, F.R., Repetto, D., Waloschek, S., Carr, C., Kranzler, S., Choi, K., Viktorin, P., Santos, J.F., Holovaty, A., Pimenta, W., Lee, H.: librosa 0.5.0 (2017)
Google Scholar
Mesaros, A., Heittola, T., Virtanen, T.: TUT database for acoustic scene classification and sound event detection. 24th European Signal Processing Conference (EUSIPCO 2016), pp. 1128–1132. IEEE, Budapest, Hungary (2016)
Chapter Google Scholar
Mesaros, A., Heittola, T., Diment, A., Elizalde, B., Shah, A., Vincent, E., Raj, B., Virtanen, T.: DCASE 2017 challenge setup: tasks, datasets and baseline system. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE2017). Munich, Germany (2017)
Google Scholar
Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T., Plumbley, M.D.: Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26(2), 379–393 (2018)
Article Google Scholar
Morris, R., McDuff, D., Calvo, R.: Crowdsourcing techniques for affective computing. In: Calvo, R.A., D’Mello, S., Gratch, J., Kappas, A. (eds.) Handbook of Affective Computing, pp. 384–394. Oxford University Press, Oxford Library of Psychology (2015)
Google Scholar
Najafabadi, M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N., Wald, R., Muharemagic, E.: Deep learning applications and challenges in big data analytics. J. Big Data 2(1), 1 (2015)
Article Google Scholar
Newman, M., Watts, D., Strogatz, S.: Random graph models of social networks. Proc. Natl. Acad. Sci. 99(suppl 1), 2566–2572 (2002)
Article Google Scholar
Pancoast, S., Akbacak, M.: Bag-of-audio-words approach for multimedia event classification. In: Proceedings of Interspeech: 13th Annual Conference of the International Speech Communication Association, pp. 2105–2108. ISCA, Portland, OR, USA (2012)
Google Scholar
Pancoast, S., Akbacak, M.: Softening quantization in bag-of-audio-words. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1370–1374. IEEE (2014)
Google Scholar
Panwar, S., Das, A., Roopaei, M., Rad, P.: A deep learning approach for mapping music genres. In: 2017 12th System of Systems Engineering Conference (SoSE), pp. 1–5. IEEE (2017)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Article Google Scholar
Piczak, K.J.: ESC: Dataset for environmental sound classification. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 1015–1018. ACM (2015)
Google Scholar
Romera-Paredes, B., Torr, P.: An embarrassingly simple approach to zero-shot learning. In: International Conference on Machine Learning, pp. 2152–2161 (2015)
Google Scholar
Rubinov, M., Knock, S.A., Stam, C.J., Micheloyannis, S., Harris, A.W., Williams, L.M., Breakspear, M.: Small-world properties of nonlinear brain activity in schizophrenia. Hum. Brain Mapp. 30(2), 403–416 (2009)
Article Google Scholar
Salamon, J., Bello, J.P.: Unsupervised feature learning for urban sound classification. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 171–175. IEEE (2015)
Google Scholar
Schmitt, M., Schuller, B.: openXBOW—introducing the passau open-source crossmodal bag-of-words toolkit. J. Mach. Learn. Res. 18(96), 1–5 (2017a)
MathSciNet MATH Google Scholar
Schmitt, M., Schuller, B.: openxbow introducing the passau open-source crossmodal bag-of-words toolkit. J. Mach. Learn. Res. 18(96), 1–5 (2017b)
MathSciNet MATH Google Scholar
Schmitt, M., Ringeval, F., Schuller, B.: At the border of acoustics and linguistics: bag-of-audio-words for the recognition of emotions in speech. In: Proceedings of Interspeech, San Francisco, CA, pp. 495–499 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Res. Repos. (CoRR) (2014). arXiv:1409.1556
Socher, R., Ganjoo, M., Manning, C.D., Ng, A.: Zero-shot learning through cross-modal transfer. In: Advances in Neural Information Processing Systems, pp. 935–943 (2013)
Google Scholar
Strogatz, S.: Exploring complex networks. Nature 410(6825), 268–276 (2001)
Article Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015). IEEE, Boston, MA, USA (2015)
Google Scholar
Tarasov, A., Delany, S.J., Cullen, C.: Using crowdsourcing for labelling emotional speech assets. In: Proceedings of the W3C Workshop on Emotion Markup Language (EmotionML), pp. 1–5. Springer, Paris, France (2010)
Google Scholar
Tchorz, J., Wollermann, S., Husstedt, H.: Classification of environmental sounds for future hearing aid applications. In: Proceedings of the 28th Conference on Electronic Speech Signal Processing (ESSV 2017), Saarbrücken, Germany, pp. 294–299 (2017)
Google Scholar
Tirilly, P., Claveau, V., Gros, P.: Language modeling for bag-of-visual words image categorization. In: Proceedings of the International Conference on Content-based Image and Video Retrieval, pp. 249–258. ACM, Niagara Falls, Canada (2008)
Google Scholar
Valenti, M., Diment, A., Parascandolo, G., Squartini, S., Virtanen, T.: Dcase 2016 acoustic scene classification using convolutional neural networks. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), pp. 95–99 (2016)
Google Scholar
Von Ahn, L.: Games with a purpose. Computer 39, 92–94 (2006)
Article Google Scholar
Watts, D., Strogatz, S.: Collective dynamics of small-world networks. Nature 393(6684), 440–442 (1998)
Article Google Scholar
Zhang, Z., Schuller, B.: Active learning by sparse instance tracking and classifier confidence in acoustic emotion recognition. In: Proceedings of Interspeech, pp. 362–365. ISCA, Portland, OR, USA (2012)
Google Scholar
Zhang, Y., Coutinho, E., Zhang, Z., Quan, C., Schuller, B.: Dynamic active learning based on agreement and applied to emotion recognition in spoken interactions. In: Proceedings of International Conference on Multimodal Interaction, pp. 275–278. ACM, Seattle, USA (2015a)
Google Scholar
Zhang, Z., Coutinho, E., Deng, J., Schuller, B.: Cooperative learning and its application to emotion recognition from speech. IEEE Trans. Audio Speech Lang. Process. 23(1), 115–126 (2015b)
Google Scholar
Zhang, Z., Coutinho, E., Deng, J., Schuller, B.: Cooperative learning and its application to emotion recognition from speech. IEEE Trans. Audio Speech Lang. Process. 23, 115–126 (2015c)
Google Scholar
Zhang, Z., Ringeval, F., Dong, B., Coutinho, E., Marchi, E., Schuller, B.: Enhanced semi-supervised learning for multimodal emotion recognition. In: Proceedings of ICASSP, pp 5185–5189. IEEE, Shanghai, P.R. China (2016a)
Google Scholar
Zhang, Y., Zhou, Y., Shen, J., Schuller, B.: Semi-autonomous data enrichment based on cross-task labelling of missing targets for holistic speech analysis. In: Proceedings of 41st IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2016, pp. 6090–6094. IEEE, Shanghai, P.R. China (2016b)
Google Scholar
Zhang, Z., Cummins, N., Schuller, B.: Advanced data exploitation in speech analysis—an overview. IEEE Signal Process. Mag. 34, 24 (2017)
Article Google Scholar
Zhu, X.: Semi-supervised learning literature survey. Technical report. TR 1530, Department of Computer Sciences, University of Wisconsin at Madison, Madison, WI (2006)
Google Scholar

Download references

Acknowledgements

This work was supported by the European Unions’s seventh Framework Programme under grant agreement No. 338164 (ERC StG iHEARu).

Author information

Authors and Affiliations

ZD.B Chair of Embedded Intelligence for Health Care and Wellbeing, Augsburg, Germany
Shahin Amiriparian, Maximilian Schmitt, Simone Hantke, Vedhas Pandit & Björn Schuller
Machine Intelligence & Signal Processing Group, Technische Universität München, Munich, Germany
Shahin Amiriparian & Simone Hantke
Group on Language Audio and Music, Imperial College London, London, UK
Björn Schuller

Authors

Shahin Amiriparian
View author publications
You can also search for this author in PubMed Google Scholar
Maximilian Schmitt
View author publications
You can also search for this author in PubMed Google Scholar
Simone Hantke
View author publications
You can also search for this author in PubMed Google Scholar
Vedhas Pandit
View author publications
You can also search for this author in PubMed Google Scholar
Björn Schuller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shahin Amiriparian .

Editor information

Editors and Affiliations

Dipartimento di Psicologia and International Institute for Advanced Scientific Studies (IIASS), Università degli Studi della Campania “Luigi Vanvitelli”, Caserta, Italy
Anna Esposito
Sezione di Napoli, Osservatorio Vesuviano, Istituto Nazionale di Geofisica e Vulcanologia, Napoli, Italy
Antonietta M. Esposito
University of Canberra, Canberra, ACT, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Amiriparian, S., Schmitt, M., Hantke, S., Pandit, V., Schuller, B. (2019). Humans Inside: Cooperative Big Multimedia Data Mining. In: Esposito, A., Esposito, A., Jain, L. (eds) Innovations in Big Data Mining and Embedded Knowledge. Intelligent Systems Reference Library, vol 159. Springer, Cham. https://doi.org/10.1007/978-3-030-15939-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-15939-9_12
Published: 03 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-15938-2
Online ISBN: 978-3-030-15939-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Humans Inside: Cooperative Big Multimedia Data Mining

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Data Mining Techniques for Videos Subscribers of Google YouTube

Multimedia Social Big Data: Mining

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Humans Inside: Cooperative Big Multimedia Data Mining

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Data Mining Techniques for Videos Subscribers of Google YouTube

Multimedia Social Big Data: Mining

Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation