Abstract
There have been many efforts in the last decade in the health informatics community to develop systems that can automatically recognize and predict disclosures on social media. However, a majority of such efforts have focused on simple topic prediction or sentiment classification. However, taboo disclosures on social media that people are not comfortable to talk with their friends represent an abstract theme dependent on context and background. Recent research has demonstrated the efficacy of injecting concept into the learning model to improve prediction. We present a vectorization scheme that combines corpus- and lexicon-based approaches for predicting taboo topics from anonymous social media datasets. The proposed vectorization scheme exploits two context-rich lexicons LIWC and Urban Dictionary. Our methodology achieves cross-validation accuracies of up to 78.1% for the supervised learning task on Facebook Confessions dataset, and 70.5% for the transfer learning task on the YikYak dataset. For both the tasks, supervised algorithms trained with features generated by the proposed vectorizer perform better than vanilla tf − idf representation. This work presents a novel methodology for predicting taboos from anonymous emotional disclosures on confession boards.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Social media websites have become popular for discussing uncomfortable topics and support seeking [1]. However, identifiable communication systems suffer from inhibited behavior because of privacy and reputation concerns [2]. Although, anonymous forums provide a safe space for discussing mental health [3] and uncomfortable issues [4], anonymity has been associated with disinhibition because of freedom from accountability and self-presentation concerns [5]. Radcliffe et al. [6] suggests the importance of shared writing as a medium of emotional disclosure. Specifically, users have shown inhibition in discussing health concerns with their named identities on the internet [7, 8]. Such spaces have been characterized as hotbeds of negativity like flaming [9] and cyberbullying [10]. Some student newspapers across different colleges have complained about the presence of micro-aggressions [11] on Yik Yak [12, 13].
However, we find uncomfortable topics being discussed on anonymous forums. De Choudhury’s work [14] reveals disinhibition in the discussion of mental health topics in Reddit, and anonymous users have taken part in more emotionally engaging communication than users with pseudonymous or named identities and urge effective private interventions for people vulnerable to different types of mental illnesses. Our past work had revealed students were engaging in asking queries about taboo and stigma topics in a partially anonymous environment of Facebook Confession Boards (FCBs) [15] with negligible negative responses. The majority of the posts sought information from a local community “Does anyone know if you can get checked for STDs at X Health Center? and is it expensive?” or offered an observation or remark about the community “I wish gay girls at LGBT parties were more approachable”.
The proposed work aims to create a novel supervised machine learning based methodology that can learn and predict taboo topics from a highly contextual anonymous dataset harnessing context via context-rich lexicons. This work describes a methodology of combining a psycho-social [16] and crowd-sourced lexicon-based approach with a corpus-based approach from anonymous self-disclosure forums. As the aim of this work is to present a data-driven methodology of ascertaining written emotional disclosure in students by predicting taboos in confessions, this methodology demonstrates a synthesis of a lexicon-based approach from crowd-sourced and psycho-lingual dictionaries with a corpus-based approach for social text classification.
Multiple classification algorithms are evaluated on the proposed vectorization scheme, along with a comparison against the cross-validation accuracy results for other vectorization schemes. The system is evaluated in two ways: (a) comparative analysis with machine learning algorithms on feature matrices from our proposed vectorization approach to other approaches on the FCB dataset, and (b) transfer learning experiment on YikYak dataset, another anonymous social media platform. Our proposed methodology achieves cross-validation accuracies of up to 78.1% for the supervised learning task on the FCB dataset, and 70.5% for the transfer learning task on the YikYak dataset.
2 Background
The study of taboos in FCBs presents a unique combination of anonymity and locality in social media disclosures. In this section, background literature about studies about the impact of anonymity, locality, and taboos on social media are presented. Furthermore, a background study on the two lexicons used in our system is discussed.
2.1 Anonymity and Self-disclosure
Discussing mental health is a stigma topic [17,18,19,20], and the user might find a downvote and particularly, a removal to be a very negative response. We have seen repetitive negative feedback can actively discourage new users from staying in an online community (Everything2) [21]. In Everything2, we see some users do not participate actively but prefer being “observers” [22] but still form an essential part of the user-base. Wohn [23] and Lampe [24]’s work demonstrates that negative feedback discourages new users from returning to these respective online communities (Everything2 and Slashdot). Both of these forums allow users to have pseudonymous identities. The user reputations on these forums are public, i.e. other users are aware of this. However, Birnholtz [15] in his 2015 work found that a combination of anonymous and named identities led to a prosocial interaction. Furthermore, an emerging body of works has attempted to understand the nuances of context in different forms of text-based disclosures. D’Errico et al. introduced the concept of acid communication [25] where they explored negative social emotions such as irritation, disappointment, guilt, envy, contempt, and awe. It was distinct from emotion analysis across five primary emotions anger, happiness, disgust, sadness, and fear, as they were not the most common emotions present in social communication. In their 2016 work, Ofek et al. [26] exploited concept information for developing an unsupervised knowledge enrichment system for sentiment analysis. Such works have demonstrated the success of techniques that configure affective computing systems by harnessing concept. Domain-specific lexicons perform better in comparison to domain-independent lexicons [27, 28] for sentiment analysis. Feldman et al. [29] determines which is the most appropriate set of questions to ask for health interventions.These works demonstrate how self-disclosure on online forums are connected to mental and emotional health. However, most of these approaches are either limited to qualitative studies or unsupervised text mining tasks or sentiment prediction.
Anonymity has been seen to have a positive impact on self-disclosure, and the SIDE [30] model in social psychology describes that members of a group form a group identity and conform to norms. Thus deindividuation in an anonymous environment can lead to a more collective identity. Postmes et al. [31] found that anonymity in a group can promote normative behavior, and normative processes can shape behavior in anonymous groups although members in the community do not know each other. Sassenberg and Postmes [32] found that strategic and cognitive processes interact to produce social influence within the group based on the perception of society and self within it, and those due to the positioning of self vis-a-vis a group. Researchers have studied the impact of anonymity for many decades. Wildman [33] investigated the influence of anonymity on survey responses. Choudhury et al.’s works [7, 14, 34, 35] hinted that dissociative anonymity creates an atmosphere of disinhibition in sharing about mental health concerns and smoking and drinking abstinence on Reddit. Andalibi et al. [36] investigated social media disclosures of sexual abuse in their 2016 paper. In their 2004 work, Eysenbach et al. demonstrated that people connect with others in similar circumstances [37].
2.2 Locality
Locality has an impact on both named and anonymous social media. In particular, the condition of anonymity in a geographically local setting can be violated if specific individuals are identified [38]. Personal information can be accidentally revealed on locally anonymous apps such as YikYak [39] or specific individuals can be identified that can result in cyberbullying attacks [10, 40].
From studies of location-based dating applications, it is known that location can affect the type of content users are willing to share online [38, 41]. Past studies about online interaction with nearby people have shown that people seek information about local topics [15], coordinate social encounters [42], or reach out for and provide help in crises [43].
In the recent past, resources for sharing information with, and asking questions to members of local communities are becoming popular. Some anonymous communities such as Cyclopath [44] and EveryBlock [45] allocate persistent pseudo-anonymous identities. Another application, YikYak, allows members of offline communities, such as colleges or other such campuses, to anonymously share with their colleagues or friends [15].
2.3 Taboos
Baxter et al. [46] defines taboo topics as those that are “off limits” to one party or another in a social relationship, anticipating a negative outcome from such a discussion. Goodwin et al. [47] formulated catalogs about potential taboo topics in different cultures. Their work indicated that taboo could vary contextually, and they found common taboo themes for a Western audience include family matters/details, hygiene, prejudice, and sexual topics. An elaborate labeling scheme for taboo topics based on social science literature [46, 47] was developed as part of our previous work [15]. There were nine categories of taboos originating in the dataset: (1) death, (2) bodily functions, (3) sex, 4) illegal substances (e.g., drugs and other controlled substances), (5) protected social categories (such as gender, race, and sexual orientation), (6) finances, (7) physiological health, (8) mental health and (9) academic performance.
2.4 Lexicons
In this paper, we harness two dictionaries: LIWC and Urban Dictionary.
LIWC is a well-recognized psycho-lingual lexicon-based tool that counts words (unigrams) in psychologically meaningful categories that analyze text files on a word-by-word basis using an internal dictionary of frequent words and word stems. During the 2008 US elections, LIWC was used [48] to analyze and distinguish the usage frequency of different words/categories by political candidates. The current English LIWC dictionary contains more than 4,500 words. It classifies words into many linguistic and psychological categories that harness social, cognitive, and affective processes. Each word has been classified or rated by experts on 64 word categories: 22 standard linguistic categories (e.g., pronouns, verb, tenses), 32 psychological categories (e.g., affect, cognition, social, biological processes), 7 personal categories (e.g., work, home, leisure), and 3 paralinguistic dimensions (assents, fillers, nonfluencies). Each word in a text is tallied with a word in the dictionary, and the associated term characteristics are extracted.
Urban Dictionary [49] (UD)is the largest source for slang and Internet terms with over six million crowd-sourced definitions. In comparison, Oxford English Dictionary has just over 250,000 entries [50]. Internet Linguistics [51, 52] is a relatively new field of research but already has shown signs of changing mainstream discourse. Urban Dictionary allows any user to submit a definition or description for a given word. It has outgrown its initial intent of a repository of slangs and modern cultural references into a full-grown dictionary. Its lexicon has also broadened to include words or phrases of any usage, rather than just slang. Quality control is imposed through up and down voting by users to float up popular and accepted definitions and reject those that are not.
Both dictionaries provide useful context but are distinct from each other. LIWC was developed by psycholinguists who studied how people tended to use different words based on their emotional state. In that context, it can be used as a vectorizer by creating numerical features from a body of text with each category serving as each dimension. As UD can provide a huge lexicon of words derived from popular culture unlike other dictionaries such as Dictionary.com [53] and Merriam-Webster.com [54], it can be used to find related colloquial words for most used words in each taboo category. This helps in synthetically creating a more richer corpus with a relatively smaller training data.
2.5 Text Mining Algorithms
We compare our proposed work with other popular and successful text mining approaches. Naive Bayes [55, 56] is a Bayesian classification algorithm and has demonstrated success for text classification using Bag of Words or tf − idf representations. LinearSVM [57] is another algorithm that is popular for text categorization as it is relatively agnostic of the sparsity of the feature matrix. Random Forests [58] and Randomized Decision Trees [59] (also known as ExtraTrees) use an ensemble of decision trees to make a decision and are one of the most successful traditional machine learning algorithms. LSA [60] is a technique in natural language processing for analyzing and comparing concepts across a set of documents. It is also used for dimensionality reduction to generate a dense matrix by Singular Value Decomposition on a sparse Bag of Words or tf − idf representation. Embedding schemes such as GloVe [61] and Word2Vec [62] that consider co-occurrence of different words, have demonstrated state-of-the-art performance for most machine learning tasks in the recent past. GloVe is a pre-trained unsupervised learning algorithm based for obtaining co-occurrence vector representations for generating word embeddings from a corpus containing Wikipedia, Twitter and a collection of webpages. Word2Vec is another embedding scheme that utilizes a shallow two-layered neural network to construct a co-occurrence matrix from an unlabeled corpus. Word2Vec has two flavors: Continuous Bag of Words [62] and Skipgrams [63]. LSTMs are supervised recurrent neural networks that incorporate long-term word dependencies.
Wikarsa et al. [64] developed a system using naive bayes algorithm to predict six primary emotions: happiness, sadness, anger, disgust, fear, and surprise. Lupan et al. [65] developed an emotional state monitoring system using Latent Semantic Analysis called Emo2 to quantify emotions induced by news articles. Herzig et al. [66] used a word embedding approach on five datasets for emotion detection across different domains, and saw significant improvements over traditional methods. LSTMs are the current state-of-the-art for many emotional text mining problems. Schoene et al. [67] used a type of LSTM to classify suicide notes. Su et al. [68] used an LSTM network to predict across seven emotional classes: anger, boredom, disgust, anxiety, happiness, sadness, and surprise, and found large improvements over other predictive methods. Chancellor et al. [69] provide a detailed critical review of the predictive techniques for mental health status on social media.
Further, transfer learning is a form of machine learning that focus on storing knowledge gained while solving one problem and applying it to a different but related problem [70]. Although most machine learning systems are designed to address single tasks, transfer learning can accelerate learning across different but related problems. For instance, knowledge gained while learning to recognize automobiles could apply when trying to recognize trucks. Furthermore, as most real-world social media mining applications involve a data stream and not a static data source, the distribution of the data is not known a priori. Hence, evaluation of a proposed model using transfer learning on a similar but different dataset enhances confidence about the generalizability of the model.
3 Dataset
We describe the data collection, metadata information and annotation process for the two datasets in this section.
3.1 Data Collection
FCBs are facebook groups targeted at offline communities such as universities [71], high schools, and workplaces. FCBs allow posting via an external web form such as SurveyMonkey that anybody can anonymously submit content to and is later re-posted to the corresponding FCB by the moderator. However, commenters on the FCBs are identified by their Facebook profiles. For our study, we use FCBs from top universities and liberal arts colleges (based on US News & World Report [72, 73]). The student population of the schools for which FCBs were chosen ranged from 1000 to 45,000 students with the volume of posts varying between 100 to 20,000 posts. There was no correlation found between post volume and college size. Timeline Scraper API was developed that harnessed the Python-based Facebook Graph API [74] for downloading timeline information for the confession boards.
YikYak is an anonymous mobile-based social media app that combines GPS with instant messaging allowing users to post a YikYak message called “yak” anonymously. A yak can have a maximum size of 200 characters and visible to other nearby users within a variable radius of 1.5–5 miles (depending on user density), that makes it well suited for college campuses [39, 77]. Anyone can post, vote or comment on content within the limits of this zone, but users outside the radius have only view privilege. With the features of geo-locality, anonymity, and ephemerality of the posts, YikYak provides a reciprocal data source worthy of future investigation. An open source GitHub code [76] written in python was used for collecting yaks. For consistency and to avoid lexical differences due to location, the same set of universities were used for Facebook Confessions.
Although both FCBs and YikYak are confession forums, they are different in many ways. The visibility of FCBs are global. One can view FCBs in any part of the world. YikYak is only visible locally (the dataset is collected by synthetically updating location to be proximal to university campuses). This distinction leads to hyper-local nature of yaks compared to FCB confessions. FCBs are moderated. In fact, not only are posts dropped by moderators in some cases, but also campus moderators can inform the school authorities for posts with threats [78]. YikYak is not moderated or in a sense auto-moderated as posts automatically get downvoted. However, it is possible that controversial posts can get popular on YikYak which might have been taken down by moderators for F CBs. Also, FCB posts are permanent unless the moderator pro-actively takes down old posts or the page is taken down. Yaks are ephemeral and vanish after a while. Most importantly, the limited length of the posts in yaks lead to more abbreviations and hashtags compared to Facebook Confessions where there is no character limit. These differences make it interesting to study both confession forums.
3.2 Metadata
The text, date, and number of likes and comments were extracted for each confession post. As the posts were anonymous, any other demographic data could not be collected. There was no difference between labeled and unlabeled posts in post length and comment volume. However, there was a small difference in the number of likes, but it was statistically not significant p< 0.05. The comments were not annotated as the number of comments per posts was not very high.
Similarly, in the YikYak dataset (p< 0.5), there was no statistically significant difference in the metadata information between the labeled and unlabeled yak data. 1000 yaks were randomly chosen for labeling ensuring all the universities were represented. Table 1 gives a description of metadata information of the datasets.
Individual or university identifiers were removed, and any examples with identifying details are avoided. as it is critical for researchers to consider user privacy and the possibility of inadvertent identification even when the dataset is public.
3.3 Annotation Process
The annotation process for labeling taboos was non-trivial and time-consuming as it required an in-depth understanding of taboo literature. It was hence important to focus on quality and do in-house training rather than use Amazon Mechanical Turk [79, 80]. The annotators were undergraduate students in social sciences. An annotation scheme used in our past work [15] was implemented which in turn was based on past literature on taboos. There are nine taboo categories - protected categories, death/dying, academics, illegal substances, physiological health, mental health, personal financial situation, bodily function, and sex with each post assigned to no more than one taboo category, denoted by class labels from 1 through 9. In case a post does not contain a taboo, it is labeled as 0. For the purpose of understanding the dataset, a group of 3 annotators labeled 700 posts, and an agreement of more than 80% across all the taboo categories was achieved. The goal of this phase was to attune the annotators with the labeling scheme and ensure consistency. The 700 posts used from this initial phase was discarded, and a new set of 4000 posts were labeled, of which 1000 were labeled by all the annotators (agreement > 93%) and the remaining 3000 were split between the annotators.
In the event of contention between two or more categories, the category that is most pertinent was selected. It is to be noted that the topic of the post content can be different from the taboo topic mentioned in the post. In Table 2, description of each taboo category is presented with their relative percentage with respect to taboo posts and an example. Table 3 delineates example posts in which the general topic of the post was distinctly different from the taboo. A table cataloging example of taboos, labeled by the annotators for the YikYak dataset, is provided.
The annotators found about 30% of all the posts were taboo-related, i.e., belonged to the 9 taboo categories, and the remaining 70% belonged to non-taboo categories. This is not expected as we do not expect majority of disclosures to be taboo. Following are some examples which we did not consider taboos although a simple semantic topic analysis of the posts might tag them with that taboo. This is because the disclosure is not discussing something uncomfortable.
-
1.
Sexual: I’ve made it a goal to hook up with (almost) every girl from a certain sorority - Hooking up not necessarily synonymous with sex
-
2.
Academics: I can’t stand this school, and I’m sick of trying to. - (i) mention of “this school” is not related to performance and hence not tagged as academics, and (ii) as the student seems to leave the school and not suicidal - hence not tagged as mental health
4 Method
In this section, the various steps involved in the proposed taboo categorization system - text cleaning, oversampling of minority classes, vectorization and classification are described. Figure 1 depicts the flow diagram of the system.
4.1 Data Cleaning
Data cleaning [81] is a preliminary and integral step in social text mining as text data from social media is highly unstructured and noisy in nature. Furthermore, the annotators observed that FCB data contained more noise in the form of bad grammar or typos compared to generic Facebook pages. This is expected as previous research have indicated that users of anonymous environments are less concerned with self-presentation as compared to identified spaces [82].
Furthermore, posts that only included URLs were removed. A preliminary text analysis illustrated that posts that commenced with urls contained spam or some generic information. It is to be noted that the removal of slangs was avoided during the text cleaning phase as the context from slang words are harnessed in the proposed approach.
The TextBlob API [83] was used for grammar and typo correction.
4.2 Oversampling of Minority Classes
The taboo posts formed a small percentage (30%) of the entire dataset, and many of the taboo categories formed less than 1% of the labeled corpus. To make sure the taboo posts were a representative sample of the whole set of FCB posts, posts containing taboo were oversampled. Oversampling is a common procedure in spam detection algorithms as spam emails are a small subset of the universal set of all emails. The imbalance in the labeled dataset is compensated by applying an oversampling technique called Synthetic Minority Oversampling Technique (SMOTE) [84]. In this approach, k nearest neighbors of a training sample belonging to the minority class are generated. Thus, the minority class(or classes) is oversampled exploiting the artificial training samples. Random oversampling techniques were also investigated but SMOTE delivered better performance. Different degrees of oversampling, the number of times a minority class sample is oversampled, were investigated. The best trade-off between performance and over-fitting was determined at an oversampling of 100% - on average, each taboo post is repeated once.
4.3 Vectorization and Classification
For the problem of text categorization of a document, the usual tf − idf based representation of a document is a feature vector representation of a given document as a set of term sequences, including term t and term weight w. The document is made up of pairs of <t, w> with the term and weight representing the features which express the post content and value relevant to the coordinate respectively. Thus, every document (d) is mapped to the target space as a feature vector. In the case of the term frequency, the simplest choice is to use the raw frequency of a term in a document, the number of times that term t occurs in the document d. The inverse document frequency is a metric for determining how much information the document can provide, that is, whether the term is commonly or rarely present across all the documents. Mathematically, it is the logarithmically scaled inverse fraction of the documents that contain the word. We obtain it by dividing the total number of documents by the number of documents containing the term and then taking the logarithm of the quotient. The tf − idf matrix representation is obtained by taking a product of the term frequency with the inverse document frequency.
The LIWC [85] and Urban Dictionary lexicons were used to enrich the vector space and introduce more context into the model. For Urban Dictionary, a python-based scraper was designed, which could extract related words for the top 20 words based on the results of the tf − idf vectorized classification model for each taboo category. Figure 2 presents a snapshot of the related words returned by the Urban dictionary website for the search term “mental illness”. It is to be noted that the entire Urban Dictionary is not used as the lexicon. The primary motivation of using Urban Dictionary stems from the reasoning that the confessions corpus for the disclosures contained slangs and modern cultural references that can be harnessed and incorporated into the model. The LIWC text analysis tool provides 64 semantic categories. For both the Urban Dictionary and LIWC lexicons, a vector of token counts based on a number of occurrences is constructed. The Urban Dictionary-based matrix is composed of count vectors for each of the nine taboo categories, and the LIWC-based matrix is composed of count vectors for each of the 64 LIWC categories. A catalog of words extracted from Urban Dictionary and words for selected LIWC categories is presented. The BeautifulSoup [86] and requests [87] library was employed for extracting the related words from the Urban Dictionary website.
The vectorizer is constructed by first creating a sparse tf − idf representation of the corpus. LSA transformation is performed to transform the matrix to a dense representation using dimensionality reduction via singular value decomposition. The resultant dense matrix is stacked with the Urban Dictionary (UD) and LIWC-based feature matrix. Different combinations of stacking the vectorizer matrices were explored. Table 5 provides a comparison of the cross-validation accuracies for each of the combinations. It must be emphasized that in this work, stacking refers to feature stacking that combines distinct sets of features from multiple sources. Feature stacking is distinct from model stacking that involves stacking multiple models for performing supervised classification.
The Scikit-Learn [88] library was used for feature engineering, dimensionality reduction and supervised machine learning tasks. The Gensim [89] library was employed for generating word embeddings. Keras [90] wrapper with Tensorflow [91] backend was availed for the benchmarking experiments using LSTM.
4.4 Transfer Learning
Transfer learning can help us harness the learned context of learning from a source dataset for a task on a destination dataset. This is critical in the context of anonymous social media in particular as it one anonymous confession platform can get shut down or lose popularity. We observed churn of users from FCBs to YikYak and then after YikYak closed down [92], there was a churn to Whisper [93] and Reddit confession forums, and now FCBs are regaining popularity. Due to the ephemerality of these platforms, it would ordinarily require regenerating training data for each new forum, and getting high quality annotated data can be logistically expensive. One of the motivations of this work was to demonstrate that a dictionary-based approach from the corpus of one technology medium can work on another medium.
Typically in transfer learning, the source dataset might provide more overall thematic context which the destination dataset may not be able to provide. The destination dataset on the other hand provides more specific context. Given that FCB does not have character or word limits, we believe that we can gain more contextual information from the FCB dataset that cannot be achieved as effectively in the shorter yak posts - restricted to 200 characters. The transfer learning experiments from FCB (source) on the YikYak (target) dataset instead of training a combined model would help us validate and evaluate the efficacy and generalizability of the dictionary-based approach. There is a second reason for choosing FCBs as the source dataset - FCBs are still active forums while YikYak has been retired. FCBs has been around for almost a decade and while some individual university pages have stopped generating content or have been closed, new FCB pages have started.
5 Results and Discussion
In this section, we would present the experimental results on the FCB and YikYak datasets using proposed approach, and compare them with other approaches including state-of-the-art techniques such as LSTM and Embeddings. Further, we discuss some rationale behind the superior performance of our proposed algorithm with other techniques for this problem.
5.1 Experimental Results
Table 4 presents the comparison of cross-validation accuracy for the proposed stacked vectorizer across different machine learning algorithms about other text vectorization schemes that have proved to be successful for various text mining tasks. Extensive grid search across hyper-parameters and different combinations of stopword lists and n-gram range were performed for all the machine learning algorithms until the best cross-validation performance was achieved. For the LSTM classifier, various combinations of loss functions, batch sizes, and dropout were explored.
The prediction accuracy using RandomForests and ExtraTrees algorithms and the proposed vectorization scheme on the FCB dataset surpasses the accuracy using LinearSVM on a vanilla tf − idf representation (statistically significant p < 0.01). Figures 3, 4 present the confusion matrices for classification using vanilla tf − idf representation and our proposed vectorization scheme respectively. Figure 5 illustrate the confusion matrix for the predicted labels after cross-validation.
Different feature stacking combinations for the proposed vectorizer are explored, and the results of the comparison are presented in Table 5. The prediction accuracy using RandomForests and ExtraTrees algorithms on the proposed vectorization scheme on the YikYak dataset surpasses the accuracy using LinearSVM and a vanilla tf − idf representation (statistically significant p < 0.05). The accuracy for the transfer learning task is lower across all the algorithms compared to the supervised task. We gather there are two primary rationales for this. The YikYak dataset has a different distribution of taboo categories compared to FCBs. Furthermore, the annotation schema used for labeling was primarily developed for categorizing taboos in FCBs.
5.2 Discussion
In this work, we attempt to build a supervised learning approach to predict taboo topics by harnessing psycho-lingual and crowd-sourced dictionaries. The proposed vectorization approach was compared against other vectorization schemes namely Bag of Words, tf − idf, LSA, GloVe and Word2Vec. Although accuracy using vanilla tf − idf was lower than the proposed stacked vectorizer, it performed much better than other vectorization approaches. This was not unexpected as Word2Vec models perform well on much larger datasets, and word embeddings in GloVe - trained on a corpus of Wikipedia and twitter data has different distribution of content and semantic information compared to the FCB dataset.
For the vanilla tf − idf based model, the best performance was achieved using LinearSVM [57]. This can be attributed to SVMs [97] being universal learners as support vectors can be considered independent of the dimensionality of the feature space. Hence, SVM can learn from sparse feature matrices originating from Bag of Words or tf − idf representations. For the proposed stacked vectorizer, the best performance was achieved using Extra Trees and Random Forests classifiers. Both algorithms utilize an ensemble of decision trees that allow them to reduce the classification bias.
Although LSTMs do not perform at par with RandomForest or ExtraTrees for the FCB dataset, they perform better than other algorithms. It can be anticipated that training on a larger labeled corpus would lead to better cross-validation accuracy. The lower accuracy on transfer learning task on the YikYak dataset is understandable due to an even smaller dataset.
One interesting observation from this study was the superior performance of RandomForest and ExtraTrees compared to LinearSVMs, which usually perform best amongst traditional machine learning algorithms for text categorization tasks. This can be attributed to the reduced dimension of the feature matrix when using the proposed vectorizer compared to vanilla tf − idf.
A comparison of the confusion matrices for the vanilla tf − idf representation (Fig. 3) with the proposed stacked vectorizer (Fig. 4) demonstrates the success of introducing context via use of lexicons. tf − idf representation is better at categorizing texts that do not contain any taboo and this may be due to bias in the classifier towards the majority class which denotes no taboo. All the taboo categories are minority classes. However, as a result of both dimensionality reduction of the tf − idf matrix as well as combining it with feature representation from the lexicons, the bias is reduced using our proposed vectorizer.
The novel vectorization scheme propounded in our study illustrates the scope of concept-driven supervised learning models to predict abstract topics such as taboos from a social media corpus. The importance of understanding context is even more important for supervised learning from a small dataset. Application of deep neural networks on text categorization tasks has suggested reduced need for feature engineering and reduction. However, the caveat with deep neural network-based models such as LSTMs or convolutional neural networks is that it usually necessitates a large labeled dataset. Thus, for smaller datasets, an explicit understanding of the dataset domain, and subsequent feature engineering can produce better prediction accuracy. Table 5 depicts that inclusion of both corpus- and lexicon-based information help in enriching prediction models and supersede accuracy compared to only corpus- or lexicon-based feature representations.
6 Conclusions and Future Work
A methodology for prediction of taboo topics from social media disclosures using the synthesis of a corpus-based approach with crowd-sourced and psycho-lingual lexicons is propounded in this work. Psychological text analysis tool LIWC and crowd-sourced dictionary Urban Dictionary are combined with tf − idf vectorization for supervised learning of taboos from anonymous social media datasets. The proposed approach that stacks feature matrices extracted from corpus- and lexicon-based approaches deliver higher prediction accuracy than learning from corpus-based or lexicon-based approaches alone. The proposed methodology achieves cross-validation accuracies of up to 78.1% on the supervised learning task on FCB dataset and 70.5% on the transfer learning task on the YikYak dataset. With this ensemble methodology, abstract concepts or themes (in this case taboo ) can be identified. The relative success of transfer learning on the YikYak dataset hints at the success of generalizing the approach for supervised learning from self-disclosure texts to learn abstract themes.
An effective active learning system can lower the expense of annotation by selecting samples that would be essential for improving classification accuracy. Furthermore, we plan to release this work in the future as a web-based application and API where a client can submit a social media post or an unlabeled corpus respectively as a request and obtain a prediction with the confidence score for each taboo category. The success of ensemble decision tree based algorithms in reducing bias in the classification results urges the exploration of combining multiple learning models using boosting and bagging [98]. Although word2vec did not yield satisfactory results on the FCB and YikYak datasets, future exploration of paragraph vector [99] can overcome the loss of semantic information while learning from a dataset of varying lengths. We would urge researchers to investigate other combinations of combining corpus- and lexicon-based approaches, including combining embedding-based approaches with lexicons.
References
Montagni I, Parizot I, Horgan A, Gonzalez-Caballero J-L, Almenara-Barrios J, Lagares-Franco C, Peralta-Sáez J-L, Chauvin P, Amaddeo F (2016) Spanish students’ use of the internet for mental health information and support seeking. Health Inform J 22(2):333–354
Morris MR, Teevan J, Panovich K (2010) What do people ask their social networks, and why?: a survey study of status message q&a behavior. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 1739–1748
Quesada-Arencibia A, Pérez-Brito E, García-rodríguez CR, Pérez-Brito A (2018) An ehealth information technology platform to help the treatment of mental disorders. Health Inform J 24(4):337–355
Jones RB, Ashurst EJ (2013) Online anonymous discussion between service users and health professionals to ascertain stakeholder concerns in using e-health services in mental health. Health Inform J 19(4):281–299
Suler J (2004) The online disinhibition effect. Cyberpsychol Behav 7(3):321–326
Radcliffe AM, Lumley MA, Kendall J, Stevenson JK, Beltran J (2007) Written emotional disclosure: testing whether social disclosure matters. J Soc Clin Psychol 26(3):362–384
Choudhury MD, Morris MR, White RW (2014) Seeking and sharing health information online: comparing search engines and social media. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 1365–1376
Newman MW, Lauterbach D, Munson SA, Resnick P, Morris ME (2011) It’s not that I don’t have problems, I’m just not putting them on facebook: challenges and opportunities in using online social networks for health. In: Proceedings of the ACM 2011 conference on computer supported cooperative work. ACM, pp 341–350
O’sullivan PB, Flanagin AJ (2003) Reconceptualizing ‘flaming’ and other problematic messages. New Media & Society 5(1):69–94
Whittaker E, Kowalski RM (2015) Cyberbullying via social media. J Sch Violence 14(1):11–29
Sue DW, Capodilupo CM, Torino GC, Bucceri JM, Holder A, Nadal KL, Esquilin M (2007) Racial microaggressions in everyday life: implications for clinical practice. American Psychologist 62(4):271
Yik yak perpetuates culture of intolerance—the emory wheel. http://emorywheel.com/yik-yak-perpetuates-culture-of-intolerance/. Accessed 15 Apr 2017
The daily Northwestern : Hayes: Yik yak unveils social problems. http://dailynorthwestern.com/2014/05/14/opinion/hayes-yik-yak-unveils-social-problems/. Accessed 15 Apr 2017
De Choudhury M, De S (2014) Mental health discourse on reddit: self-disclosure, social support, and anonymity. In: ICWSM. Citeseer
Birnholtz J, Merola NAR, Paul A (2015) Is it weird to still be a virgin: anonymous, locally targeted questions on facebook confession boards. In: Proceedings of the 33rd annual ACM conference on human factors in computing systems. ACM, pp 2613–2622
Erikson EH (1982) Major stages in psychosocial development. The life cycle completed: a review, pp 55–82
Corrigan P (2004) How stigma interferes with mental health care. American Psychologist 59(7):614
O’Neill S, Bond RR, Grigorash A, Ramsey C, Armour C, Mulvenna MD (2019) Data analytics of call log data to identify caller behaviour patterns from a mental health and well-being helpline. Health Inform J 25(4):1722–1738. SAGE Publications Sage UK: London, England
Clarke K, Rooksby J, Rouncefield M (2007) You’ve got to take them seriously’: meeting information needs in mental healthcare. Health Inform J 13(1):37–45
Ruzic L, Sanford JA (2018) Needs assessment—health applications for people aging with multiple sclerosis. Journal of Healthcare Informatics Research 2(1-2):71–98
Sarkar C, Wohn DY, Lampe C (2012) Predicting length of membership in online community everything2 using feedback. In: Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work Companion. ACM, pp 207–210
Velasquez A, Wash R, Lampe C, Bjornrud T (2014) Latent users in an online user-generated content community. Computer Supported Cooperative Work (CSCW) 23(1):21–50
Wohn D, Velasquez A, Bjornrud T, Lampe C (2012) Habit as an explanation of participation in an online peer-production community. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 2905–2914
Lampe C, Johnston E (2005) Follow the (slash) dot: effects of feedback on new members in an online community. In: Proceedings of the 2005 international ACM SIGGROUP conference on supporting group work. ACM, pp 11–20
D’Errico F, Poggi I (2014) Acidity. the hidden face of conflictual and stressful situations. Cognitive Computation 6(4):661–676
Ofek N, Poria S, Rokach L, Cambria E, Hussain A, Shabtai A (2016) Unsupervised commonsense knowledge enrichment for domain- specific sentiment analysis. Cognitive Computation 8(3):467–477
Khan FH, Qamar U, Bashir S (2016) Multi-objective model selection (MOMS)-based semi-supervised framework for sentiment analysis. Cognitive Computation 8(4):614–628
Khan FH, Bashir S, Qamar U (2014) Tom: twitter opinion mining framework using hybrid classification scheme. Decis Support Syst 57:245–257
Feldman K, Kotoulas S, Chawla NV (2018) Tiqs: targeted iterative question selection for health interventions. Journal of Healthcare Informatics Research 2(3):205–227. Springer
Reicher SD, Spears R, Postmes T (1995) A social identity model of deindividuation phenomena. Eur Rev Soc Psychol 6(1):161–198
Postmes T, Spears R, Sakhel K, Groot DD (2001) Social influence in computer-mediated communication: the effects of anonymity on group behavior. Personal Soc Psychol Bull 27(10):1243–1254
Sassenberg K, Postmes T (2002) Cognitive and strategic processes in small groups: effects of anonymity of the self and anonymity of the group on social influence. Br J Soc Psychol 41(3):463–480
Wildman RC (1977) Effects of anonymity and social setting on survey responses. Public Opin Q 41(1):74–79
De Choudhury M (2013) Role of social media in tackling challenges in mental health. In: Proceedings of the 2nd international workshop on socially-aware multimedia. ACM, pp 49–52
Tamersoy A, De Choudhury M, Chau DH (2015) Characterizing smoking and drinking abstinence from social media. In: Proceedings of the 26th ACM conference on hypertext & social media. ACM, pp 139–148
Andalibi N, Haimson OL, Choudhury MD, Forte A (2016) Understanding social media disclosures of sexual abuse through the lenses of support seeking and anonymity. In: Proceedings of the 2016 CHI conference on human factors in computing systems. ACM, pp 3906–3918
Eysenbach G, Powell J, Englesakis M, Rizo C, Stern A (2004) Health related virtual communities and electronic support groups: systematic review of the effects of online peer to peer interactions. Bmj 328(7449):1166
Blackwell C, Birnholtz J, Abbott C (2015) Seeing and being seen: co-situation and impression formation using grindr, a location-aware gay dating app. New Media & Society 17(7):1117–1136. Sage Publications Sage UK: London, England
Yik yak - find your herd. https://www.yikyak.com
Binns A (2013) Facebook’s ugly sisters: anonymity and abuse on formspring and ask. fm. Media Education Research Journal 4:27–42
Birnholtz J, Fitzpatrick C, Handel M, Brubaker JR (2014) Identity, identification and identifiability: the language of self-presentation on a location-based mobile dating app. In: Proceedings of the 16th international conference on human-computer interaction with mobile devices & services. ACM, pp 3–12
Sutko DM, de Souza e Silva A (2011) Location-aware mobile media and urban sociability. New Media & Society 13(5):807–823. SAGE Publications Sage UK: London, England
Vieweg S, Hughes AL, Starbird K, Palen L (2010) Microblogging during two natural hazards events: what twitter may contribute to situational awareness. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 1079–1088
Cyclopath. http://cyclopath.org. Accessed 27 Sept 2016
Everyblock. http://www.everyblock.com. Accessed 27 Sept 2016
Baxter LA, Wilmot WW (1985) Taboo topics in close relationships. Journal of Social and Personal Relationships 2(3):253–269
Goodwin R, Lee I (1994) Taboo topics among chinese and english friends a cross-cultural comparison. J Cross-Cult Psychol 25(3):325–338
Lanning K, Maruyama G (2010) The social psychology of the 2008 us presidential election. Analyses of Social Issues and Public Policy 10(1):171–181
Urban Dictionary (2013) Urban Dictionary llc. San Francisco, available at www.urbandictionary.com/define.php
McLeese N (2015) How selfie got into the dictionary: an examination of internet linguistics and language change online
Crystal D (2011) Internet linguistics: a student guide. Routledge
Jucker AH, Dürscheid C (2012) The linguistics of keyboard-to-screen communication: a new terminological framework. Linguistik Online 56 (6/12):1–26. European University Viadrina
Dictionary.com—meanings and definitions of words at dictionary.com. http://www.dictionary.com. Accessed 23 Sept 2016
Dictionary and thesaurus—merriam-webster. http://www.merriam-webster.com. Accessed 23 Sept 2016
Zhang H, Li D (2007) Naive bayes text classifier. In: IEEE international conference on granular computing, 2007. GRC 2007. IEEE, pp 708–708
McCallum A, Nigam K et al (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization, vol 752. Madison, WI, pp 41–48
Joachims T (1998) Text categorization with support vector machines: Learning with many relevant features. In: European conference on machine learning. Springer, pp 137–142
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42
Deerwester SC, Dumais ST, Furnas GW, Harshman RA, Landauer TK, Lochbaum KE, Streeter LA Computer information retrieval using latent semantic structure. June 13 1989. US Patent 4,839,853
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: EMNLP, vol 14, pp 1532–1543
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:https://arxiv.org/abs/1301.3781
Wikarsa L, Thahir SN (2015) A text mining application of emotion classifications of twitter’s users using naive bayes method. In: 2015 1st international conference on wireless and telematics (ICWT). IEEE, pp 1–6
Lupan D, Dascălu M, Trăusan-Matu S, Dessus P (2012) Analyzing emotional states induced by news articles with latent semantic analysis. In: International conference on artificial intelligence: methodology, systems, and applications. Springer, pp 59–68
Herzig J, Shmueli-Scheuer M, Konopnicki D (2017) Emotion detection from text via ensemble classification using word embeddings. In: Proceedings of the ACM SIGIR international conference on theory of information retrieval, pp 269–272
Schoene AM, Lacey G, Turner AP, Dethlefs N (2019) Dilated lstm with attention for classification of suicide notes. In: Proceedings of the tenth international workshop on health text mining and information analysis (LOUHI 2019), pp 136–145
Su M-H, Wu C-H, Huang K-Y, Hong Q-B (2018) Lstm-based text emotion recognition using semantic emotional word vectors. In: 2018 first Asian conference on affective computing and intelligent interaction (ACII Asia). IEEE, pp 1–6
Chancellor S, Choudhury MD (2020) Methods in predictive techniques for mental health status on social media: a critical review. NPJ Digital Medicine 3 (1):1–11
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22(10):1345–1359
Mit confessions. https://www.facebook.com/beaverconfessions
National university rankings—top national universities—us news best colleges. http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-universities. Accessed 27 Sept 2016
National liberal arts college rankings—top liberal arts colleges—us news best colleges. http://colleges.usnews.rankingsandreviews.com/best-colleges/rankings/national-liberal-arts-colleges. Accessed 27 Sept 2016
Weaver J, Tarjan P (2013) Facebook linked data via the graph api. Semantic Web 4(3):245–250
Timeline scraper - dashboard - facebook for developers. https://developers.facebook.com/apps/463500207102372/dashboard/. Accessed 23 June 2017
Groom B (2015) Pyak. https://github.com/bradengroom/pyak
Nemelka CL, Ballard CL, Liu K, Xue M, Ross KW (2015) You can yak but you can’t hide. In: Proceedings of the 2015 ACM on conference on online social networks. ACM, pp 99–99
Kadvany E (2020) Anonymous confessions pages are surging in popularity on high school and college campuses why?
Amazon mechanical turk - welcome. https://www.mturk.com/mturk/welcome. Accessed 12 Oct 2016
Buhrmester M, Kwang T, Gosling SD (2011) Amazon’s mechanical turk a new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science 6(1):3–5
Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23(4):3–13
Krämer NC, Winter S (2008) Impression management 2.0: the relationship of self-esteem, extraversion, self-efficacy, and self-presentation within social networking sites. Journal of Media Psychology 20(3):106–116
Loria S (2018) textblob Documentation.Release 0.15 2
Chawla NV, Bowyer KW, Hall LO, Philip Kegelmeyer W (2002) Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16:321–357
Pennebaker JW, Booth RJ, Francis ME (2007) Liwc2007: linguistic inquiry and word count. Austin, Texas: liwc.net
Richardson L (2007) Beautiful soup documentation. April
Requests: Http for humans — requests 2.18.1 documentation. http://docs.python-requests.org/en/master/
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(Oct):2825–2830
Rehurek R, Sojka P (2011) Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic
Chollet F et al (2015) Keras. GitHub, https://github.com/fchollet/keras
TensorFlow Team (2015) Tensorflow: large-scale machine learning on heterogeneous systems. Software available from tensorflow.org
Williams G, Mahmoud A (2018) Modeling user concerns in the app store: a case study on the rise and fall of yik yak. In: 2018 IEEE 26th international requirements engineering conference (RE). IEEE, pp 64–75
Grunkemeyer RA (2016) 10. whisper–an effective use of anonymous persuasion?
Harris ZS (1954) Distributional structure. Word 10(2-3):146–162
Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development 1 (4):309–317
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computation 9(8):1735–1780
Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory. ACM, pp 144–152
Ross Quinlan J et al (1996) Bagging, boosting, and c4. 5. In: AAAI/IAAI, vol 1, pp 725–730
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1188–1196
Funding
This work is supported in part by the following grants: NSF award CCF-1409601; DOE awards DE-SC0014330, DE-SC0019358.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Paul, A., Liao, Wk., Choudhary, A. et al. Harnessing Psycho-lingual and Crowd-Sourced Dictionaries for Predicting Taboos in Written Emotional Disclosure in Anonymous Confession Boards. J Healthc Inform Res 5, 319–341 (2021). https://doi.org/10.1007/s41666-021-00092-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41666-021-00092-w