1 Introduction

Social media websites have become popular for discussing uncomfortable topics and support seeking [1]. However, identifiable communication systems suffer from inhibited behavior because of privacy and reputation concerns [2]. Although, anonymous forums provide a safe space for discussing mental health [3] and uncomfortable issues [4], anonymity has been associated with disinhibition because of freedom from accountability and self-presentation concerns [5]. Radcliffe et al. [6] suggests the importance of shared writing as a medium of emotional disclosure. Specifically, users have shown inhibition in discussing health concerns with their named identities on the internet [7, 8]. Such spaces have been characterized as hotbeds of negativity like flaming [9] and cyberbullying [10]. Some student newspapers across different colleges have complained about the presence of micro-aggressions [11] on Yik Yak [12, 13].

However, we find uncomfortable topics being discussed on anonymous forums. De Choudhury’s work [14] reveals disinhibition in the discussion of mental health topics in Reddit, and anonymous users have taken part in more emotionally engaging communication than users with pseudonymous or named identities and urge effective private interventions for people vulnerable to different types of mental illnesses. Our past work had revealed students were engaging in asking queries about taboo and stigma topics in a partially anonymous environment of Facebook Confession Boards (FCBs) [15] with negligible negative responses. The majority of the posts sought information from a local community “Does anyone know if you can get checked for STDs at X Health Center? and is it expensive?” or offered an observation or remark about the community “I wish gay girls at LGBT parties were more approachable”.

The proposed work aims to create a novel supervised machine learning based methodology that can learn and predict taboo topics from a highly contextual anonymous dataset harnessing context via context-rich lexicons. This work describes a methodology of combining a psycho-social [16] and crowd-sourced lexicon-based approach with a corpus-based approach from anonymous self-disclosure forums. As the aim of this work is to present a data-driven methodology of ascertaining written emotional disclosure in students by predicting taboos in confessions, this methodology demonstrates a synthesis of a lexicon-based approach from crowd-sourced and psycho-lingual dictionaries with a corpus-based approach for social text classification.

Multiple classification algorithms are evaluated on the proposed vectorization scheme, along with a comparison against the cross-validation accuracy results for other vectorization schemes. The system is evaluated in two ways: (a) comparative analysis with machine learning algorithms on feature matrices from our proposed vectorization approach to other approaches on the FCB dataset, and (b) transfer learning experiment on YikYak dataset, another anonymous social media platform. Our proposed methodology achieves cross-validation accuracies of up to 78.1% for the supervised learning task on the FCB dataset, and 70.5% for the transfer learning task on the YikYak dataset.

2 Background

The study of taboos in FCBs presents a unique combination of anonymity and locality in social media disclosures. In this section, background literature about studies about the impact of anonymity, locality, and taboos on social media are presented. Furthermore, a background study on the two lexicons used in our system is discussed.

2.1 Anonymity and Self-disclosure

Discussing mental health is a stigma topic [17,18,19,20], and the user might find a downvote and particularly, a removal to be a very negative response. We have seen repetitive negative feedback can actively discourage new users from staying in an online community (Everything2) [21]. In Everything2, we see some users do not participate actively but prefer being “observers” [22] but still form an essential part of the user-base. Wohn [23] and Lampe [24]’s work demonstrates that negative feedback discourages new users from returning to these respective online communities (Everything2 and Slashdot). Both of these forums allow users to have pseudonymous identities. The user reputations on these forums are public, i.e. other users are aware of this. However, Birnholtz [15] in his 2015 work found that a combination of anonymous and named identities led to a prosocial interaction. Furthermore, an emerging body of works has attempted to understand the nuances of context in different forms of text-based disclosures. D’Errico et al. introduced the concept of acid communication [25] where they explored negative social emotions such as irritation, disappointment, guilt, envy, contempt, and awe. It was distinct from emotion analysis across five primary emotions anger, happiness, disgust, sadness, and fear, as they were not the most common emotions present in social communication. In their 2016 work, Ofek et al. [26] exploited concept information for developing an unsupervised knowledge enrichment system for sentiment analysis. Such works have demonstrated the success of techniques that configure affective computing systems by harnessing concept. Domain-specific lexicons perform better in comparison to domain-independent lexicons [27, 28] for sentiment analysis. Feldman et al. [29] determines which is the most appropriate set of questions to ask for health interventions.These works demonstrate how self-disclosure on online forums are connected to mental and emotional health. However, most of these approaches are either limited to qualitative studies or unsupervised text mining tasks or sentiment prediction.

Anonymity has been seen to have a positive impact on self-disclosure, and the SIDE [30] model in social psychology describes that members of a group form a group identity and conform to norms. Thus deindividuation in an anonymous environment can lead to a more collective identity. Postmes et al. [31] found that anonymity in a group can promote normative behavior, and normative processes can shape behavior in anonymous groups although members in the community do not know each other. Sassenberg and Postmes [32] found that strategic and cognitive processes interact to produce social influence within the group based on the perception of society and self within it, and those due to the positioning of self vis-a-vis a group. Researchers have studied the impact of anonymity for many decades. Wildman [33] investigated the influence of anonymity on survey responses. Choudhury et al.’s works [7, 14, 34, 35] hinted that dissociative anonymity creates an atmosphere of disinhibition in sharing about mental health concerns and smoking and drinking abstinence on Reddit. Andalibi et al. [36] investigated social media disclosures of sexual abuse in their 2016 paper. In their 2004 work, Eysenbach et al. demonstrated that people connect with others in similar circumstances [37].

2.2 Locality

Locality has an impact on both named and anonymous social media. In particular, the condition of anonymity in a geographically local setting can be violated if specific individuals are identified [38]. Personal information can be accidentally revealed on locally anonymous apps such as YikYak [39] or specific individuals can be identified that can result in cyberbullying attacks [10, 40].

From studies of location-based dating applications, it is known that location can affect the type of content users are willing to share online [38, 41]. Past studies about online interaction with nearby people have shown that people seek information about local topics [15], coordinate social encounters [42], or reach out for and provide help in crises [43].

In the recent past, resources for sharing information with, and asking questions to members of local communities are becoming popular. Some anonymous communities such as Cyclopath [44] and EveryBlock [45] allocate persistent pseudo-anonymous identities. Another application, YikYak, allows members of offline communities, such as colleges or other such campuses, to anonymously share with their colleagues or friends [15].

2.3 Taboos

Baxter et al. [46] defines taboo topics as those that are “off limits” to one party or another in a social relationship, anticipating a negative outcome from such a discussion. Goodwin et al. [47] formulated catalogs about potential taboo topics in different cultures. Their work indicated that taboo could vary contextually, and they found common taboo themes for a Western audience include family matters/details, hygiene, prejudice, and sexual topics. An elaborate labeling scheme for taboo topics based on social science literature [46, 47] was developed as part of our previous work [15]. There were nine categories of taboos originating in the dataset: (1) death, (2) bodily functions, (3) sex, 4) illegal substances (e.g., drugs and other controlled substances), (5) protected social categories (such as gender, race, and sexual orientation), (6) finances, (7) physiological health, (8) mental health and (9) academic performance.

2.4 Lexicons

In this paper, we harness two dictionaries: LIWC and Urban Dictionary.

LIWC is a well-recognized psycho-lingual lexicon-based tool that counts words (unigrams) in psychologically meaningful categories that analyze text files on a word-by-word basis using an internal dictionary of frequent words and word stems. During the 2008 US elections, LIWC was used [48] to analyze and distinguish the usage frequency of different words/categories by political candidates. The current English LIWC dictionary contains more than 4,500 words. It classifies words into many linguistic and psychological categories that harness social, cognitive, and affective processes. Each word has been classified or rated by experts on 64 word categories: 22 standard linguistic categories (e.g., pronouns, verb, tenses), 32 psychological categories (e.g., affect, cognition, social, biological processes), 7 personal categories (e.g., work, home, leisure), and 3 paralinguistic dimensions (assents, fillers, nonfluencies). Each word in a text is tallied with a word in the dictionary, and the associated term characteristics are extracted.

Urban Dictionary [49] (UD)is the largest source for slang and Internet terms with over six million crowd-sourced definitions. In comparison, Oxford English Dictionary has just over 250,000 entries [50]. Internet Linguistics [51, 52] is a relatively new field of research but already has shown signs of changing mainstream discourse. Urban Dictionary allows any user to submit a definition or description for a given word. It has outgrown its initial intent of a repository of slangs and modern cultural references into a full-grown dictionary. Its lexicon has also broadened to include words or phrases of any usage, rather than just slang. Quality control is imposed through up and down voting by users to float up popular and accepted definitions and reject those that are not.

Both dictionaries provide useful context but are distinct from each other. LIWC was developed by psycholinguists who studied how people tended to use different words based on their emotional state. In that context, it can be used as a vectorizer by creating numerical features from a body of text with each category serving as each dimension. As UD can provide a huge lexicon of words derived from popular culture unlike other dictionaries such as Dictionary.com [53] and Merriam-Webster.com [54], it can be used to find related colloquial words for most used words in each taboo category. This helps in synthetically creating a more richer corpus with a relatively smaller training data.

2.5 Text Mining Algorithms

We compare our proposed work with other popular and successful text mining approaches. Naive Bayes [55, 56] is a Bayesian classification algorithm and has demonstrated success for text classification using Bag of Words or tf idf representations. LinearSVM [57] is another algorithm that is popular for text categorization as it is relatively agnostic of the sparsity of the feature matrix. Random Forests [58] and Randomized Decision Trees [59] (also known as ExtraTrees) use an ensemble of decision trees to make a decision and are one of the most successful traditional machine learning algorithms. LSA [60] is a technique in natural language processing for analyzing and comparing concepts across a set of documents. It is also used for dimensionality reduction to generate a dense matrix by Singular Value Decomposition on a sparse Bag of Words or tf idf representation. Embedding schemes such as GloVe [61] and Word2Vec [62] that consider co-occurrence of different words, have demonstrated state-of-the-art performance for most machine learning tasks in the recent past. GloVe is a pre-trained unsupervised learning algorithm based for obtaining co-occurrence vector representations for generating word embeddings from a corpus containing Wikipedia, Twitter and a collection of webpages. Word2Vec is another embedding scheme that utilizes a shallow two-layered neural network to construct a co-occurrence matrix from an unlabeled corpus. Word2Vec has two flavors: Continuous Bag of Words [62] and Skipgrams [63]. LSTMs are supervised recurrent neural networks that incorporate long-term word dependencies.

Wikarsa et al. [64] developed a system using naive bayes algorithm to predict six primary emotions: happiness, sadness, anger, disgust, fear, and surprise. Lupan et al. [65] developed an emotional state monitoring system using Latent Semantic Analysis called Emo2 to quantify emotions induced by news articles. Herzig et al. [66] used a word embedding approach on five datasets for emotion detection across different domains, and saw significant improvements over traditional methods. LSTMs are the current state-of-the-art for many emotional text mining problems. Schoene et al. [67] used a type of LSTM to classify suicide notes. Su et al. [68] used an LSTM network to predict across seven emotional classes: anger, boredom, disgust, anxiety, happiness, sadness, and surprise, and found large improvements over other predictive methods. Chancellor et al. [69] provide a detailed critical review of the predictive techniques for mental health status on social media.

Further, transfer learning is a form of machine learning that focus on storing knowledge gained while solving one problem and applying it to a different but related problem [70]. Although most machine learning systems are designed to address single tasks, transfer learning can accelerate learning across different but related problems. For instance, knowledge gained while learning to recognize automobiles could apply when trying to recognize trucks. Furthermore, as most real-world social media mining applications involve a data stream and not a static data source, the distribution of the data is not known a priori. Hence, evaluation of a proposed model using transfer learning on a similar but different dataset enhances confidence about the generalizability of the model.

3 Dataset

We describe the data collection, metadata information and annotation process for the two datasets in this section.

3.1 Data Collection

FCBs are facebook groups targeted at offline communities such as universities [71], high schools, and workplaces. FCBs allow posting via an external web form such as SurveyMonkey that anybody can anonymously submit content to and is later re-posted to the corresponding FCB by the moderator. However, commenters on the FCBs are identified by their Facebook profiles. For our study, we use FCBs from top universities and liberal arts colleges (based on US News & World Report [72, 73]). The student population of the schools for which FCBs were chosen ranged from 1000 to 45,000 students with the volume of posts varying between 100 to 20,000 posts. There was no correlation found between post volume and college size. Timeline Scraper API was developed that harnessed the Python-based Facebook Graph API [74] for downloading timeline information for the confession boards.

YikYak is an anonymous mobile-based social media app that combines GPS with instant messaging allowing users to post a YikYak message called “yak” anonymously. A yak can have a maximum size of 200 characters and visible to other nearby users within a variable radius of 1.5–5 miles (depending on user density), that makes it well suited for college campuses [39, 77]. Anyone can post, vote or comment on content within the limits of this zone, but users outside the radius have only view privilege. With the features of geo-locality, anonymity, and ephemerality of the posts, YikYak provides a reciprocal data source worthy of future investigation. An open source GitHub code [76] written in python was used for collecting yaks. For consistency and to avoid lexical differences due to location, the same set of universities were used for Facebook Confessions.

Although both FCBs and YikYak are confession forums, they are different in many ways. The visibility of FCBs are global. One can view FCBs in any part of the world. YikYak is only visible locally (the dataset is collected by synthetically updating location to be proximal to university campuses). This distinction leads to hyper-local nature of yaks compared to FCB confessions. FCBs are moderated. In fact, not only are posts dropped by moderators in some cases, but also campus moderators can inform the school authorities for posts with threats [78]. YikYak is not moderated or in a sense auto-moderated as posts automatically get downvoted. However, it is possible that controversial posts can get popular on YikYak which might have been taken down by moderators for F CBs. Also, FCB posts are permanent unless the moderator pro-actively takes down old posts or the page is taken down. Yaks are ephemeral and vanish after a while. Most importantly, the limited length of the posts in yaks lead to more abbreviations and hashtags compared to Facebook Confessions where there is no character limit. These differences make it interesting to study both confession forums.

3.2 Metadata

The text, date, and number of likes and comments were extracted for each confession post. As the posts were anonymous, any other demographic data could not be collected. There was no difference between labeled and unlabeled posts in post length and comment volume. However, there was a small difference in the number of likes, but it was statistically not significant p< 0.05. The comments were not annotated as the number of comments per posts was not very high.

Similarly, in the YikYak dataset (p< 0.5), there was no statistically significant difference in the metadata information between the labeled and unlabeled yak data. 1000 yaks were randomly chosen for labeling ensuring all the universities were represented. Table 1 gives a description of metadata information of the datasets.

Table 1 Description of the FCB and YikYak dataset

Individual or university identifiers were removed, and any examples with identifying details are avoided. as it is critical for researchers to consider user privacy and the possibility of inadvertent identification even when the dataset is public.

3.3 Annotation Process

The annotation process for labeling taboos was non-trivial and time-consuming as it required an in-depth understanding of taboo literature. It was hence important to focus on quality and do in-house training rather than use Amazon Mechanical Turk [79, 80]. The annotators were undergraduate students in social sciences. An annotation scheme used in our past work [15] was implemented which in turn was based on past literature on taboos. There are nine taboo categories - protected categories, death/dying, academics, illegal substances, physiological health, mental health, personal financial situation, bodily function, and sex with each post assigned to no more than one taboo category, denoted by class labels from 1 through 9. In case a post does not contain a taboo, it is labeled as 0. For the purpose of understanding the dataset, a group of 3 annotators labeled 700 posts, and an agreement of more than 80% across all the taboo categories was achieved. The goal of this phase was to attune the annotators with the labeling scheme and ensure consistency. The 700 posts used from this initial phase was discarded, and a new set of 4000 posts were labeled, of which 1000 were labeled by all the annotators (agreement > 93%) and the remaining 3000 were split between the annotators.

In the event of contention between two or more categories, the category that is most pertinent was selected. It is to be noted that the topic of the post content can be different from the taboo topic mentioned in the post. In Table 2, description of each taboo category is presented with their relative percentage with respect to taboo posts and an example. Table 3 delineates example posts in which the general topic of the post was distinctly different from the taboo. A table cataloging example of taboos, labeled by the annotators for the YikYak dataset, is provided.

Table 2 30% of all the posts were taboo-related
Table 3 Examples from each category where the taboo topic was distinct from the general topic of the post

The annotators found about 30% of all the posts were taboo-related, i.e., belonged to the 9 taboo categories, and the remaining 70% belonged to non-taboo categories. This is not expected as we do not expect majority of disclosures to be taboo. Following are some examples which we did not consider taboos although a simple semantic topic analysis of the posts might tag them with that taboo. This is because the disclosure is not discussing something uncomfortable.

  1. 1.

    Sexual: I’ve made it a goal to hook up with (almost) every girl from a certain sorority - Hooking up not necessarily synonymous with sex

  2. 2.

    Academics: I can’t stand this school, and I’m sick of trying to. - (i) mention of “this school” is not related to performance and hence not tagged as academics, and (ii) as the student seems to leave the school and not suicidal - hence not tagged as mental health

4 Method

In this section, the various steps involved in the proposed taboo categorization system - text cleaning, oversampling of minority classes, vectorization and classification are described. Figure 1 depicts the flow diagram of the system.

Fig. 1
figure 1

Flow-diagram of the entire text mining system

4.1 Data Cleaning

Data cleaning [81] is a preliminary and integral step in social text mining as text data from social media is highly unstructured and noisy in nature. Furthermore, the annotators observed that FCB data contained more noise in the form of bad grammar or typos compared to generic Facebook pages. This is expected as previous research have indicated that users of anonymous environments are less concerned with self-presentation as compared to identified spaces [82].

Furthermore, posts that only included URLs were removed. A preliminary text analysis illustrated that posts that commenced with urls contained spam or some generic information. It is to be noted that the removal of slangs was avoided during the text cleaning phase as the context from slang words are harnessed in the proposed approach.

The TextBlob API [83] was used for grammar and typo correction.

4.2 Oversampling of Minority Classes

The taboo posts formed a small percentage (30%) of the entire dataset, and many of the taboo categories formed less than 1% of the labeled corpus. To make sure the taboo posts were a representative sample of the whole set of FCB posts, posts containing taboo were oversampled. Oversampling is a common procedure in spam detection algorithms as spam emails are a small subset of the universal set of all emails. The imbalance in the labeled dataset is compensated by applying an oversampling technique called Synthetic Minority Oversampling Technique (SMOTE) [84]. In this approach, k nearest neighbors of a training sample belonging to the minority class are generated. Thus, the minority class(or classes) is oversampled exploiting the artificial training samples. Random oversampling techniques were also investigated but SMOTE delivered better performance. Different degrees of oversampling, the number of times a minority class sample is oversampled, were investigated. The best trade-off between performance and over-fitting was determined at an oversampling of 100% - on average, each taboo post is repeated once.

4.3 Vectorization and Classification

For the problem of text categorization of a document, the usual tf idf based representation of a document is a feature vector representation of a given document as a set of term sequences, including term t and term weight w. The document is made up of pairs of <t, w> with the term and weight representing the features which express the post content and value relevant to the coordinate respectively. Thus, every document (d) is mapped to the target space as a feature vector. In the case of the term frequency, the simplest choice is to use the raw frequency of a term in a document, the number of times that term t occurs in the document d. The inverse document frequency is a metric for determining how much information the document can provide, that is, whether the term is commonly or rarely present across all the documents. Mathematically, it is the logarithmically scaled inverse fraction of the documents that contain the word. We obtain it by dividing the total number of documents by the number of documents containing the term and then taking the logarithm of the quotient. The tf idf matrix representation is obtained by taking a product of the term frequency with the inverse document frequency.

The LIWC [85] and Urban Dictionary lexicons were used to enrich the vector space and introduce more context into the model. For Urban Dictionary, a python-based scraper was designed, which could extract related words for the top 20 words based on the results of the tf idf vectorized classification model for each taboo category. Figure 2 presents a snapshot of the related words returned by the Urban dictionary website for the search term “mental illness”. It is to be noted that the entire Urban Dictionary is not used as the lexicon. The primary motivation of using Urban Dictionary stems from the reasoning that the confessions corpus for the disclosures contained slangs and modern cultural references that can be harnessed and incorporated into the model. The LIWC text analysis tool provides 64 semantic categories. For both the Urban Dictionary and LIWC lexicons, a vector of token counts based on a number of occurrences is constructed. The Urban Dictionary-based matrix is composed of count vectors for each of the nine taboo categories, and the LIWC-based matrix is composed of count vectors for each of the 64 LIWC categories. A catalog of words extracted from Urban Dictionary and words for selected LIWC categories is presented. The BeautifulSoup [86] and requests [87] library was employed for extracting the related words from the Urban Dictionary website.

Fig. 2
figure 2

Example from Urban Dictionary (courtesy: urbandictionary [49])

The vectorizer is constructed by first creating a sparse tf idf representation of the corpus. LSA transformation is performed to transform the matrix to a dense representation using dimensionality reduction via singular value decomposition. The resultant dense matrix is stacked with the Urban Dictionary (UD) and LIWC-based feature matrix. Different combinations of stacking the vectorizer matrices were explored. Table 5 provides a comparison of the cross-validation accuracies for each of the combinations. It must be emphasized that in this work, stacking refers to feature stacking that combines distinct sets of features from multiple sources. Feature stacking is distinct from model stacking that involves stacking multiple models for performing supervised classification.

The Scikit-Learn [88] library was used for feature engineering, dimensionality reduction and supervised machine learning tasks. The Gensim [89] library was employed for generating word embeddings. Keras [90] wrapper with Tensorflow [91] backend was availed for the benchmarking experiments using LSTM.

4.4 Transfer Learning

Transfer learning can help us harness the learned context of learning from a source dataset for a task on a destination dataset. This is critical in the context of anonymous social media in particular as it one anonymous confession platform can get shut down or lose popularity. We observed churn of users from FCBs to YikYak and then after YikYak closed down [92], there was a churn to Whisper [93] and Reddit confession forums, and now FCBs are regaining popularity. Due to the ephemerality of these platforms, it would ordinarily require regenerating training data for each new forum, and getting high quality annotated data can be logistically expensive. One of the motivations of this work was to demonstrate that a dictionary-based approach from the corpus of one technology medium can work on another medium.

Typically in transfer learning, the source dataset might provide more overall thematic context which the destination dataset may not be able to provide. The destination dataset on the other hand provides more specific context. Given that FCB does not have character or word limits, we believe that we can gain more contextual information from the FCB dataset that cannot be achieved as effectively in the shorter yak posts - restricted to 200 characters. The transfer learning experiments from FCB (source) on the YikYak (target) dataset instead of training a combined model would help us validate and evaluate the efficacy and generalizability of the dictionary-based approach. There is a second reason for choosing FCBs as the source dataset - FCBs are still active forums while YikYak has been retired. FCBs has been around for almost a decade and while some individual university pages have stopped generating content or have been closed, new FCB pages have started.

5 Results and Discussion

In this section, we would present the experimental results on the FCB and YikYak datasets using proposed approach, and compare them with other approaches including state-of-the-art techniques such as LSTM and Embeddings. Further, we discuss some rationale behind the superior performance of our proposed algorithm with other techniques for this problem.

5.1 Experimental Results

Table 4 presents the comparison of cross-validation accuracy for the proposed stacked vectorizer across different machine learning algorithms about other text vectorization schemes that have proved to be successful for various text mining tasks. Extensive grid search across hyper-parameters and different combinations of stopword lists and n-gram range were performed for all the machine learning algorithms until the best cross-validation performance was achieved. For the LSTM classifier, various combinations of loss functions, batch sizes, and dropout were explored.

Table 4 Evaluation of cross-validation accuracy across different models for FCB and YikYak are presented (asterisk indicates models that used other vectorizers instead of the proposed vectorizer in this work)

The prediction accuracy using RandomForests and ExtraTrees algorithms and the proposed vectorization scheme on the FCB dataset surpasses the accuracy using LinearSVM on a vanilla tf idf representation (statistically significant p < 0.01). Figures 34 present the confusion matrices for classification using vanilla tf idf representation and our proposed vectorization scheme respectively. Figure 5 illustrate the confusion matrix for the predicted labels after cross-validation.

Fig. 3
figure 3

Confusion Matrix using tf idf. Labels 1 to 9 are the class labels for the taboo categories that use the same scheme depicted in Table 2. Label 0 is the label attributed to a post with no taboo

Fig. 4
figure 4

Confusion Matrix using proposed model. Labels 1 to 9 are the class labels for the taboo categories that use the same scheme depicted in Table 2. Label 0 is the label attributed to a post with no taboo

Fig. 5
figure 5

Confusion matrix for the YikYak dataset. Labels 0 to 9 are the class labels for the taboo categories that use the same scheme depicted in Table 2. Label 0 is the label attributed to a post with no taboo

Different feature stacking combinations for the proposed vectorizer are explored, and the results of the comparison are presented in Table 5. The prediction accuracy using RandomForests and ExtraTrees algorithms on the proposed vectorization scheme on the YikYak dataset surpasses the accuracy using LinearSVM and a vanilla tf idf representation (statistically significant p < 0.05). The accuracy for the transfer learning task is lower across all the algorithms compared to the supervised task. We gather there are two primary rationales for this. The YikYak dataset has a different distribution of taboo categories compared to FCBs. Furthermore, the annotation schema used for labeling was primarily developed for categorizing taboos in FCBs.

Table 5 Comparison of cross-validation accuracy across different combinations of the stacked vectorizer approach (upto 3 significant digits)

5.2 Discussion

In this work, we attempt to build a supervised learning approach to predict taboo topics by harnessing psycho-lingual and crowd-sourced dictionaries. The proposed vectorization approach was compared against other vectorization schemes namely Bag of Words, tf idf, LSA, GloVe and Word2Vec. Although accuracy using vanilla tf idf was lower than the proposed stacked vectorizer, it performed much better than other vectorization approaches. This was not unexpected as Word2Vec models perform well on much larger datasets, and word embeddings in GloVe - trained on a corpus of Wikipedia and twitter data has different distribution of content and semantic information compared to the FCB dataset.

For the vanilla tf idf based model, the best performance was achieved using LinearSVM [57]. This can be attributed to SVMs [97] being universal learners as support vectors can be considered independent of the dimensionality of the feature space. Hence, SVM can learn from sparse feature matrices originating from Bag of Words or tf idf representations. For the proposed stacked vectorizer, the best performance was achieved using Extra Trees and Random Forests classifiers. Both algorithms utilize an ensemble of decision trees that allow them to reduce the classification bias.

Although LSTMs do not perform at par with RandomForest or ExtraTrees for the FCB dataset, they perform better than other algorithms. It can be anticipated that training on a larger labeled corpus would lead to better cross-validation accuracy. The lower accuracy on transfer learning task on the YikYak dataset is understandable due to an even smaller dataset.

One interesting observation from this study was the superior performance of RandomForest and ExtraTrees compared to LinearSVMs, which usually perform best amongst traditional machine learning algorithms for text categorization tasks. This can be attributed to the reduced dimension of the feature matrix when using the proposed vectorizer compared to vanilla tf idf.

A comparison of the confusion matrices for the vanilla tf idf representation (Fig. 3) with the proposed stacked vectorizer (Fig. 4) demonstrates the success of introducing context via use of lexicons. tf idf representation is better at categorizing texts that do not contain any taboo and this may be due to bias in the classifier towards the majority class which denotes no taboo. All the taboo categories are minority classes. However, as a result of both dimensionality reduction of the tf idf matrix as well as combining it with feature representation from the lexicons, the bias is reduced using our proposed vectorizer.

The novel vectorization scheme propounded in our study illustrates the scope of concept-driven supervised learning models to predict abstract topics such as taboos from a social media corpus. The importance of understanding context is even more important for supervised learning from a small dataset. Application of deep neural networks on text categorization tasks has suggested reduced need for feature engineering and reduction. However, the caveat with deep neural network-based models such as LSTMs or convolutional neural networks is that it usually necessitates a large labeled dataset. Thus, for smaller datasets, an explicit understanding of the dataset domain, and subsequent feature engineering can produce better prediction accuracy. Table 5 depicts that inclusion of both corpus- and lexicon-based information help in enriching prediction models and supersede accuracy compared to only corpus- or lexicon-based feature representations.

6 Conclusions and Future Work

A methodology for prediction of taboo topics from social media disclosures using the synthesis of a corpus-based approach with crowd-sourced and psycho-lingual lexicons is propounded in this work. Psychological text analysis tool LIWC and crowd-sourced dictionary Urban Dictionary are combined with tf idf vectorization for supervised learning of taboos from anonymous social media datasets. The proposed approach that stacks feature matrices extracted from corpus- and lexicon-based approaches deliver higher prediction accuracy than learning from corpus-based or lexicon-based approaches alone. The proposed methodology achieves cross-validation accuracies of up to 78.1% on the supervised learning task on FCB dataset and 70.5% on the transfer learning task on the YikYak dataset. With this ensemble methodology, abstract concepts or themes (in this case taboo ) can be identified. The relative success of transfer learning on the YikYak dataset hints at the success of generalizing the approach for supervised learning from self-disclosure texts to learn abstract themes.

An effective active learning system can lower the expense of annotation by selecting samples that would be essential for improving classification accuracy. Furthermore, we plan to release this work in the future as a web-based application and API where a client can submit a social media post or an unlabeled corpus respectively as a request and obtain a prediction with the confidence score for each taboo category. The success of ensemble decision tree based algorithms in reducing bias in the classification results urges the exploration of combining multiple learning models using boosting and bagging [98]. Although word2vec did not yield satisfactory results on the FCB and YikYak datasets, future exploration of paragraph vector [99] can overcome the loss of semantic information while learning from a dataset of varying lengths. We would urge researchers to investigate other combinations of combining corpus- and lexicon-based approaches, including combining embedding-based approaches with lexicons.