Keywords

1 Introduction

1.1 Aims and Learning Objectives

In this hands-on tutorial (details and material at: https://socialmediaie.github.io/tutorials/), we introduce the participants to working with social media data, which are an example of Digital Social Trace Data (DSTD). The DSTD abstraction allows us to model social media data with rich information associated with social media text, such as authors, topics, and time stamps. We introduce the participants to several Python-based, open-source tools for performing Information Extraction (IE) on social media data. Furthermore, the participants will be familiarized with a catalogue of more than 30 publicly available social media corpora for various IE tasks such as named entity recognition (NER), part of speech (POS) tagging, chunking, super sense tagging, entity linking, sentiment classification, and hate speech identification. We will also show how these approaches can be expanded to word in a multi-lingual setting. Finally, the participants will be introduced to the following applications of extracted information: (i) combining network analysis and text-based signals to rank accounts, and (ii) correlation between sentiment and user-level attributes in existing corpora. The tutorial aims to serve the following use cases for social media researchers: (iii) high accuracy IE on social media text via multi-task and semi-supervised learning, including the recent transformer-based tools which work across languages, (iv) rapid annotation of new data for text classification via active human-in-the-loop learning, (v) temporal visualization of the communication structure in social media corpora via social communication temporal graph visualization technique, and (vi) detecting and prioritizing needs during crisis events (e.g., COVID19). (vii) Furthermore, the participants will be familiarized with a catalogue of more than 30 publicly available social media corpora for various IE tasks, e.g., named entity recognition (NER), part of speech (POS) tagging, chunking, super sense tagging, entity linking, sentiment classification, and hate speech identification. We propose a full day tutorial session using Python based open-source tools. This tutorial builds upon our previous tutorials on this topic at ACM Hypertext 2019, IC2S2 2020, WWW 2021.

1.2 Scope and Benefit to the ECIR Community

Information extraction (IE) is a common sub-area of natural language processing that focuses on identifying structured data from unstructured data. While many open source tools are available for performing IE on newswire and academic publication corpora, there is a lack of such tool when dealing with social media corpora, which tends to exhibit very different linguistic patterns compared to the other corpora. It has also been found that publicly available tools for IE, which are trained on news and academic corpora do not tend to perform very well on social media corpora. Topics of interest include: (i) Machine learning for social media IE (ii) Generating annotated text classification data using active human-in-the-loop learning (iii) Public corpora for social media IE (iv) Open source tools for social media IE (v) Visualizing social media corpora (vi) Bias in social media IE systems.

Scholars in Information Retrieval community who work with social media text can benefit from the recent machine learning advances in information extraction and retrieval in this domain, especially the knowledge of its difference from regular newswire text. This tutorial will help them learn state-of-the-art methods for processing social media text which can help them improve their information retrieval systems on social media text. They will also learn how social media text has a social context, which can be included as part of the analysis.

1.3 Presenter Bios

Shubhanshu Mishra, Twitter, Inc. Shubhanshu Mishra is a Machine Learning Researcher at Twitter. He earned his Ph.D. in Information Sciences from the University of Illinois at Urbana-Champaign in 2020 His thesis was titled “Information extraction from digital social trace data: applications in social media and scholarly data analysis”. His current work is at the intersection of machine learning, information extraction, social network analysis, and visualizations. His research has led to the development of open source tools of open source information extraction solutions from large scale social media and scholarly data. He has finished his Integrated Bachelor’s and Master’s degree in Mathematics and Computing from the Indian Institute of Technology, Kharagpur in 2012.

Rezvaneh (Shadi) Rezapour, Department of Information Science at Drexel’s College of Computing and Informatics, USA Shadi is an Assistant Professor in the Department of Information Science at Drexel’s College of Computing and Informatics. Her research interests lie at the intersection of Computational Social Science and Natural Language Processing (NLP). More specifically, she is interested in bringing computational models and social science theories together, to analyze texts and better understand and explain real-world behaviors, attitudes, and cultures. Her research goal is to develop “socially-aware” NLP models that bring social and cultural contexts in analyzing (human) language to better capture attributes, such as social identities, stances, morals, and power from language, and understand real-world communication. Shadi completed her Ph.D. in Information Sciences at University of Illinois at Urbana-Champaign (UIUC) where she was advised by Dr. Jana Diesner.

Jana Diesner, The iSchool at University of Illinois Urbana-Champaign, USA Jana is an Associate Professor at the School of Information Sciences (the iSchool) at the University of Illinois at Urbana-Champaign, where she leads the Social Computing Lab. Her research in social computing and human-centered data science combines methods from natural language processing, social network analysis and machine learning with theories from the social sciences to advance knowledge and discovery about interaction-based and information-based systems. Jana got her PhD (2012) in Societal Computing from the School of Computer Science at Carnegie Mellon University.

2 Tutorial Details

  • Duration of the tutorial: 1 day (full day)

  • Interaction Style: Hands-on-tutorial with live coding session.

  • Target audience: We expect the participants to have familiarity with python programming and social media platforms like Twitter and Facebook.

Setup and Introduction (1 h) (i) Introducing the differences between social media data versus newswire and academic data, (ii) Digital Social Trace Data abstraction for social media data, (iii) Introduction to information extraction tasks for social media data, e.g., sequence tagging (named entity, part of speech tagging, chunking, and super-sense tagging), and text classification (sentiment prediction, sarcasm detection, and abusive content detection).

Applications of information extraction (1 h) (i) Indexing social media corpora in database, (ii) Network construction from text corpora, (iii) Visualizing temporal trends in social media corpora using social communication temporal graphs, (iv) Aggregating text-based signals at the user-level, (v) Improving text classification using user-level attributes, (vi) Analyzing social debate using sentiment and political identity signals otherwise, (vii) Detecting and Prioritizing Needs during Crisis Events (e.g., COVID19), (viii) Mining and Analyzing Public Opinion Related to COVID-19, and (ix) Detecting COVID-19 Misinformation in Videos on YouTube.

Collecting and distributing social media data (30 min) (i) Overview on available annotated tweet datasets, (ii) Respecting API terms and user privacy considerations for collecting & sharing social media data, (iii) Demo on collecting data from a few social media APIs, such as Twitter and Reddit.

Break 30 min

Improving IE on social media data via Machine Learning (2 h 30 min) (i) Semi-supervised learning for Twitter NER, (ii) Multi-task learning for social media IE, (iii) Active learning for annotating social media data for text classification via SAIL (another version pySAIL to be released soon), (iv) Finetuning transformer models for monolingual and multi-lingual social media NLP tasks. (v) Biases in social media NER. (vi) Utilizing Social Context for improving NLP Models.

Conclusion and future directions (10 min) (i) Open questions in social media IE, (ii) Tutorial feedback and additional questions.