Virtual Tribes: Analyzing Attitudes Toward the LGBT Movement by Applying Machine Learning on Twitter Data

Bittner, Moritz; Dettmar, David; Morejon Jaramillo, Diego; Valta, Maximilian Johannes

doi:10.1007/978-3-030-48993-9_12

Moritz Bittner⁴,
David Dettmar⁴,
Diego Morejon Jaramillo⁵ &
…
Maximilian Johannes Valta⁵

Part of the book series: Springer Proceedings in Complexity ((SPCOM))

Included in the following conference series:

Collaborative innovation networks conference of Digital Transformation of Collaboration

669 Accesses

Abstract

In this paper, we investigate the application of machine learning techniques in the context of social media. Specifically, we aim at drawing conclusions from users’ Twitter behavior and language to users’ attitudes toward the LGBT movement. By using an adjusted procedure of the Cross Industry Standard Process for Data Mining (CRISP-DM) process, we create a prediction model for investigating and identifying those attitudes. Furthermore, we formulate step-by-step instructions for its deployment. We provide the reader with a theoretical background for our research domain and describe the methods that we use. Results show that there are two groups of contrary attitudes toward the LGBT community and that the language and behavior of users in the groups, respectively, differ from each other. Also, we identify word analyses as a valuable means for prediction. We also apply our model on another dataset to investigate its interspersion with the previously identified groups and demonstrate its effectiveness for predicting attitudes of a single actor on Twitter. Finally, we critically assess our findings and propose further fields of investigation in this area.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Driving Factors of Polarization on Twitter During Protests Against COVID-19 Mitigation Measures in Vienna

Stance and influence of Twitter users regarding the Brexit referendum

Article Open access 24 July 2017

Sentiment and Behavior Analysis of One Controversial American Individual on Twitter

Introduction

Since ancient times, tribes have been a popular concept in societies [5]. Tribes are groups of people that share the same language and values like culture and history. In particular, tribe members exalt their tribe above other tribes and groups, which leads to tribal consciousness and tribal loyalty (Cambridge Dictionary; Merriam-Webster Dictionary). The ancient tribes often lived among each other detached from others. When two different tribes met each other, conflicts were likely to arise and differences in social living, technological developments, or values came to light [10].

Today, in times of global convergence, these strong differences between tribes’ realities and belief systems seem to disappear at first sight. However, due to social fragmentation, diversification, and the development of new communication channels in the field of information and communications technology (ICT), communities that establish themselves are not easily detectable. In the following, these communities are referred to as virtual tribes. Like ancient tribes, such virtual tribes define their own truths and live within their tribes’ reality [9]. By using different tools, it is possible to identify and collect tribe members for any tribal macro-category which is the goal for an investigation by an analyst [12]. Later on, the likelihood of a certain social platform user being a member of one of these tribes can be measured by using machine learning techniques.

Holding more than 320 million active users [20] and 500 million tweets per day (Twitter, Inc.), Twitter is a great source of data that can be used for research. In the past, there have been lots of scientific investigations based on its plurality of accessible data, like extensive analyses for investigating the happiness paradox (friends in social networks generally seem to be happier than the considered user) or users’ behavior on the online platform connected to income [1, 15].

While the access of information seems to rise in the progressing information era, people are able to hide behind their online accounts when indicating a statement of political or societal relevant nature. Investigating online accounts offers opportunities for data scientists to understand trends and sentiments of society and to draw conclusions on relevant character traits of online platform participants. In contrast to classical clipboard surveys, analyzing online accounts may mitigate honesty biases as people are more willing to disclose information in online environments [18]. Therefore, this approach allows a valuable complementary perspective on sensitive topics (political or societal) compared to results from a questionnaire. Findings can be used to guide decisions made by policy-makers in the real world as a person’s personality characteristics and his/her behavior in both, real and online world, are significantly connected [6]. Findings depend on the respective chosen category of investigation. In our work, we chose to investigate controversies that arise around the topic of sexuality.

Sexuality encounters openness on the one hand and refusion on the other. Discussions about sexual orientation are shaped by the history and background of conflicting parties. Modern or traditional education and religious aspects influence the opinions of the panelists. Therefore, sexual orientation is a multilayered topic. Since the nineteenth century, organizations and communities have promoted a loosening of regulations against sexual orientations that are divergent to the conventional composition of a couple as man and wife [2]. Thus, they have made the discussion vivid and relevant to society. Disclosing communities that busy themselves with sexual orientation offers a better understanding of the composition of society as a whole.

Our work addresses the following research question: How do machine learning techniques allow us to conclude from users’ behavior and language on Twitter to their attitudes about the LGBT movement? In order to answer this question, we first give the reader an overview about the theoretical background of our research and formulate four research hypotheses. Second, we explain our used methods in detail and reveal the results of our work. Finally, we critically discuss our findings and give an outlook for further research fields.

Theoretical Background and Related Research

This section will focus on discussing fundamental definitions the reader will encounter through the rest of this paper. Besides that, other related work will be briefly discussed in order to show the relevance of the topic.

Coins

Collaborative Innovation Networks (COINs) are innovation networks that are often self-organized and form independently of formal organizational structures in companies or within company networks [12].

Tribefinder

A tribe is “a network of heterogeneous persons linked by a shared passion or emotion” [4]. The system Tribefinder identifies these virtual tribes. Using data on the social media platform Twitter, it analyzes an individual’s tweets by extracting information about key people, brands, used words, and topics of his or her tweets and categorizes the user into tribes belonging to five specific tribal macro-categories: personality, alternative realities, ideologies, lifestyle, and recreation. To analyze and identify the virtual tribes the continuous stream of tweets is an important source of information, which offers a powerful setting for studying and identifying tribes of individuals [19].

Using Tribefinder and the tribal vocabulary (which tribes are identified by which words or vocabulary) it learns, it is now possible to establish the tribal affiliations of every Twitter user. In practice, Tribefinder analyzes the individual’s word usage in her or his tweets and then assigns the corresponding personality, alternative realities, ideologies, lifestyle, and recreation tribal affiliation based on the similarities with the specific tribal vocabularies.

Hypotheses

For the purpose of our research, we formulate four hypotheses. In order to clearly predict user’s attitudes toward LGBT, we need at least two groups with different attitudes that differ in their language and behavior, which we call tribes of LGBT and Anti-LGBT:

H1: Two groups exist that highly differentiate in their attitude toward the LGBT movement.

H2: These two groups use different languages and reveal different honest signal characterizations.

In our work, we believe in the effectiveness of word analyses and demonstrate a bag-of-words approach:

H3: Analyzing users’ words used in Twitter provides a high potential for prediction.

Finally, we apply our model to another tribe that consists of people who are against gun control regulations. Intuitively, we consider a convergence of opinions between the Anti-LGBT tribe and the contra-gun-control tribe as more likely than between the LGBT-tribe and the contra-gun-control tribe:

H4: There are more Anti-LGBT tribe classified people in the contra-gun-control tribe than the LGBT tribe classified people.

Methodology

To analyze large chunks of data, a proper framework or guideline is required in order to find the best amount of accurate data for our project. Since Data Mining is a creative process which requires different skills and knowledge, it is very hard to tie the success of the project to the knowledge of a single team member [16]. Therefore, we lean on the Cross Industry Standard Process for Data Mining (CRISP-DM) guideline which will merge our thoughts and guide us through a proper way of finding accurate data for the development of this project (Table 12.1). Many of the required steps and processes to gather the data have been discussed and addressed in section “Introduction” of this paper. The CRISP-DM model is divided in six phases which can interact in a cyclic pattern. The phases are categorized as follows: Business Understanding, Data Interpretation, Data Preparation, Modeling, Evaluation, and Deployment and will be discussed in this section [22].

Table 12.1 Overview of process steps by software/methods used

Full size table

For our project, we altered the first phase of the CRISP-DM model Business Understanding to Domain Understanding, since we are gathering and understanding information about a certain domain rather than a business venue. In our approach of Domain Understanding we worked around our main project task, which was to find out how different tribes with specific characteristics develop and correlate in digital networks. In order to do that, we brainstormed and gathered our ideas on which communities clash against each other the most and which ones were represented through a social media outlet such as Twitter. Out of this brainstorming session, we decided to analyze the correlations between LGBT and Anti-LGBT communities.

In order to discover more about the differences of the communities, we reached out to inform ourselves of the basic terminologies using Google Scholar, Wikipedia and implemented Coolhunting methods for identifying the most influential trendsetters of these characteristics. To find more information about Anti-LGBT and what it is comprised of, we looked for extremist groups and websites which promote this characteristic. We also started looking for representations of these communities on Twitter by identifying important and common hashtags and popular personalities within these communities. The gathered information out of Data Interpretation is discussed further in section “Results” of this paper.

We implemented a Data Sourcing phase before the Data Interpretation phase in our model which shows an alteration from the presented CRISP-DM model. At this point we used Condor, a software program developed by Galaxyadvisors, to measure the structure, content, sentiment, and influence of social communication networks over time. Condor also provides visualization features which we use to better understand the data we gather. Here we used three different approaches to collect the required data, which we derived as useful from the Domain Understanding phase. The first approach was focused on gathering the data via the Tribe Creator, also provided by Galaxyadvisors, where a certain keyword could be used as input such as a hashtag “#” in order to filter the results by the given input.

Here the tool would provide us with users and their Twitter ids, which we could use to search for friends and followers of that specific user. The second approach was to manually search Twitter, for specific users that would also use certain keywords, hashtags, or phrases. The third approach was to use Condor and its tribe-fetch function to find certain users who also used a certain keyword. With it we obtain a list of users which we then added to Tribe Creator. The focus of this phase was to create tribes (section “Introduction”), which we would later on use to create final datasets for our data mining model. The results of this phase will be thoroughly discussed in section “Results” of this paper.

In the Data Interpretation phase, we used the raw gathered data and implemented it in Condor in order to better understand the connection between every single actor. This phase will be closely tied to the Data Preparation phase, due to the functionalities and calculations that Condor provides. Thanks to the different visualization functionalities, the user can understand how different tribes differ in structure. Apart from that, Condor allows social network functionalities to be calculated such as degree centrality, betweenness centrality, and closeness centrality which all show the importance and position of certain actors within the network. The results of this phase will be presented in section “Results” of this paper.

In our Data Preparation steps, we used different tools to properly reduce the data for its optimal and most effective use. We decided that words, their frequency, and how often they appear within a certain tribe would help us to predict a certain tendency toward a tribe. Therefore, we needed to prepare the data in such a way that words should be the most resonant part of the data. In order to do so we first used some of Condor functionalities which calculate the six honest signals of collaboration, which are the most evident through the tweets we have collected through Twitter. The six indicators are namely central leadership, rotating leadership, balanced contribution, rapid response, honest language, and shared context. With these signals, future creativity, performance, and outcomes for teams can be predicted [13]. Besides the six honest signals of collaboration, another important way of making words the core of our data was to calculate the Pennebaker Pronouns.

Here the number of pronouns within a tweet of every user was counted. Condor has a built-in function that does so automatically and calculates the probability that a certain pronoun will appear in a tweet of the observed person [13]. Pennebaker discovered that how people use pronouns has a high predictive value [14]. After having Condor prepare the data, we exported it into an R Script which was written in the language R. This programming language is also an environment for statistical computing and graphics, due to its wide variety of statistics (linear, nonlinear classification, classical statistical tests, classifications and more statistical calculations) it seemed the most efficient solution for our data. With the R Script, we prepared the words in such a way that it can be used in a machine learning algorithm. The bag-of-words approach helps us in this specific task. The bag-of-words approach describes the occurrence or frequency of a word within a certain document [8]. Any other information besides the words are discarded. With the number of occurrences, it is intuitive that similar tribes will have similar words. This phase was tightly connected with Data Interpretation and the Modeling phase, since many iterations and changes to the data had to be made in order to fit it to our model.

In the Modeling phase, we decided to use an online modeling tool called RapidMiner Studio. RapidMiner Studio is a visual workflow designer, which helps develop prototypes for predictive models. Its graphical user interface (GUI) and documentation lead the user through the whole process of modeling and provide further information about every function, algorithm, or component that is used [17]. We integrated our prepared data into the tool and applied all predictive machine learning algorithms available in the toolset of RapidMiner Studio. Cross-validation provides solid accuracy metrics for a given model and its parameters. Based on accuracy comparisons of different model configurations, we selected the best choice of attributes, model, and parameters. After the Modeling phase, the Evaluation of the model is required. Here all results of the algorithms are taken under consideration. Our decision is mainly made by the highest accuracy provided by models which were calculated with different machine learning algorithms. Accuracy is calculated by the percentage of correct predictions over the total predictions. A correct prediction indicates that the value of prediction corresponds to the label attribute we specifically picked in the Modeling phase and is applied to the RapidMiner model. The results of our Modeling and Evaluation are discussed in section “Results”.

It is important for us to develop a model which can be used for two scenarios. Firstly, for predicting a certain tribe within another tribe, and secondly, for predicting a user’s tendency toward one tribe or another depending on his/her tweets. In the Deployment phase, we prepared the model in such a way that it is accessible for every example and dataset. This is achieved by a documentation of how to use the model and where to introduce the example dataset. To conclude our methods used during this project, it is important to understand the iterative and cyclic nature as seen in Fig. 12.1. Every phase can be altered in order to adjust the final dataset to provide the best possible outcome of the intended predictive model. Within the Deployment phase, the subphase Demonstration takes its place. A finished model used with real-time data tests its potential prediction.

Results

In this section, we will present the results structured by the phases of our adjusted CRISP-DM. We worked iteratively during the project making use of the loops the methodology provides. In order to provide a clear overview, we will only present the results of the last iteration of the respective phases here.

Domain Understanding

Sexualities are split up into several groups. There is heterosexuality which can be considered the most traditional and popular sexuality and describes the sexual preference for the, respectively, other gender. Besides, there are rather alternative sexual preferences such as homosexuality, bisexuality, transsexuality, and others. Finally, most alternative sexualities sum up in the LGBT movement. Therefore, we choose this group as a major tribe for our considerations. LGBT stands for lesbian, gay, bisexual, and transgender. Moreover, variants such as LGBTQ, LGBTQ+, LGBTQI+ exist, which is also reflected in hashtag usage. All these terms usually refer to the same community and the basic idea that open-mindedness toward sexuality is important and one should tolerate all sexual minorities. As a result of our Coolhunting we identified that LGBT is the most common hashtag and community that is referred to. Therefore, we defined our LGBT tribe as people who openly support lesbian, gay, bisexual, or trans. In order to get contrasting training data for our final model, we consider people who are significantly different from LGBT supporters. Therefore, we looked at people who are opposed to the LGBT movement. Typically, related keywords in the literature are homophobia and transphobia. In the course of our explorative research on Twitter, we identified a few potential subtribes regarding these attitudes. The spectrum reaches users on Twitter who are opposed to gay marriage to users who express in their tweets that alternative sexualities are diseases, that need to be cured, and users who verbally attack LGBT communities on Twitter in a disrespectful way. To include these different phenomena, we generally defined our Anti-LGBT tribe as people who are opposed to LGBT as sexual orientations.

Data Sourcing

Currently, the V1 LGBT tribe collected in Tribefinder contains 168 members who actively use Twitter. V1 Anti-LGBT consists of 119 members. The tribe-fetch with Condor resulted in two datasets of network (Twitter) data containing a total of more than 20,000 actors (users) and 480,000 links (tweets) including all the tribe members and their respective social networks on Twitter.

Data Interpretation

This stage was highly interrelated with the consecutive Data Preparation stage (see section “Methodology”). Therefore, we include results regarding features that were actually generated by the later Data Preparation stage. Apart from Condor generated features and visualizations, we look at the tribe member datasets resulted from Data Preparation including bag-of-words features. A look at the network graphs in Condor gives a first insight into the different tribes.

The graphs depicted in Fig. 12.5 display all actors and links to the respective tribes in their surrounding network. The node color yellow highlights tribe members, the node size scales with the betweenness centrality measure. While both networks seem quite strongly connected, the LGBT network looks a bit dominant in this respect. Tribe members in LGBT are more often strongly connected and further in the middle of the graph. It is striking that the LGBT network is showing more non-tribe members that are quite central as well. In contrast, the Anti-LGBT network shows that a few tribe members are very central in the network (big yellow nodes) but there are a few central nodes of other tribes in the network. This could likely mean that the Anti-LGBT community is more isolated from and less connected with non-tribe-related important people. Moreover, there are mainly very central leaders and many non-central followers in the Anti-LGBT network.

The tendency of decentrality in the Anti-LGBT tribe versus collaborative centrality in the LGBT network is also reflected in the t-test results and boxplots (Fig. 12.6). The median Anti-LGBT tribe member has a lower betweenness and degree centrality than the median LGBT tribe member. Betweenness centrality oscillation is relatively dominated by the Anti-LGBT tribe. The word clouds generated with Condor (Fig. 12.7) gives us a good feeling for the language use of our two different tribes. The size of terms depicts the relative frequency of terms in tweets. The color indicates the detected sentiment ranging from negative (red) to positive (green). It is obvious that the LGBT tribe has an overall more positive sentiment than the Anti-LGBT tribe (which is also confirmed by t-test results). Regarding the content, we find that Anti-LGBT tribe members significantly more frequently use political (e.g. “wall”, “bill”, “nation”, “democrats”, “senate”, “Obama”, and “Trump”) and religious terms (e.g. “christian” and “god”). The language of LGBT tribe members is rather dominated by social terms (e.g. “community”, “friends”, “family”, “people”, “today”) and LGBT related terms (e.g. “trans”, “love”, “pride”, “gay”, “lgbt”, “transgender”, and “person”). Regarding Pennebaker Pronouns, a look at the word usage distributions suggests that LBGT community members tend to use pronouns in a self-related way, if the pronoun is personal, while the Anti-LGBT community tends to use more non-personal pronouns—or personal pronouns linking to other people. In particular, the t-tests validate that “my”, “me”, and “it” are significantly more often used in the LGBT tribe. Anti-LGBT tribe members on the other hand significantly more often used the pronoun “the”. Moreover, they use the pronouns “his”, “they”, and “that”, coming as bag-of-words features, significantly more frequently (Table 12.2). These findings are also reflected in the weights of the final model’s features, suggesting that pronouns features do well on contributing to the predictability of tribe membership (Tables 12.3 and 12.4).

Table 12.2 T-test results sorted by p-value, cut at p ≤ 0.05

Full size table

Table 12.3 Features which the generalized linear model attributes to Anti-LGBT

Full size table

Table 12.4 Features which the generalized linear model attributes to LGBT members

Full size table

Data Preparation

As a result of our Data Preparation, the final training and evaluation dataset is a combined dataset of all our tribe members with 134 features plus our target variable, the tribe name. 49 of the attributes come from the fetched actor data itself as well as metrics that are calculated from the actors’ network by Condor (six honest signals of collaboration and Pennebaker Pronouns). Moreover, there are 85 attributes that are generated from the aggregated link data by means of our bag-of-words processing. It should be noted that the final training dataset consists of 115 Anti-LGBT entries and 111 LGBT entries due to filters in the process such as the filter in Condor that removes actors with much less activity for meaningful metrics.

Within the final modeling process in RapidMiner (see next paragraphs), we finally deselected some of the features: features with too many missing values as well as identity-like attributes, such as names. This resulted in a final training dataset with 226 rows and 105 columns (features).

Modeling

Our model classifies a Twitter user as an LGBT (or Anti-LGBT) tribe member, given the entity including all its 104 features. Based on our Evaluation, we choose a Generalized Linear Model, a machine learning model for classification problems such as ours. We trained the model using the RapidMiner process (depicted in Fig. 12.2). The process consists of four data processing steps: data retrieval from the imported training data, selection of the target variable (Tribe), the final selection of features to be used for training, and the training and testing within a cross-validation (see Fig. 12.3).

Evaluation

The cross-validation of different feature sets and machine learning algorithms revealed the best results for our final model, which utilizes a Generalized Linear Model and 104 features. The final model’s evaluation results are depicted in Fig. 12.3. The performance can be summarized with 77.43% accuracy. The model performs slightly more precise on Anti-LGBT predictions (precision: 79.63% versus 75.42%) and slightly better recalls true LGBTs (80.18% recall vs. 74.78%). In other words, if an actor is classified as Anti-LGBT it is more likely to be correct, and if an actor is LGBT it is likely that he correctly gets detected as such, than it is respectively to correctly classify an LGBT or detect every Anti-LGBT.

In order to decide on a specific algorithm, we tested six different machine learning methods with RapidMiner Auto Model. It revealed that Naive Bayes and Generalized Linear Model performed best (Fig. 12.9). A follow-up analysis in the custom RapidMiner process proved the Generalized Linear Model performs best for our final attribute selection cross-validation. Our evaluation also demonstrates the improvement caused by the inclusion of bag-of-words features. The cross-validation robustly shows that there is an improvement of around 8% (77.43% instead of 69.03%, Fig. 12.4).

The final configuration of bag-of-words specifies the maximal allowed sparsity parameter as 0.985. Words from messages are stemmed and stop words are not removed. Regarding this configuration, we did not evaluate all possible configurations, but took a look at different configurations within a reasonable range. Better results were reached with a higher maximal allowed sparsity level. However, we limited the allowed word sparsity at some point to keep the number of attributes relatively low. Stemming words and not removing words were proved to be dominant over all other combinations of these Booleans in terms of resulting model accuracy.

Deployment

One goal of ours evolved to be a deployable solution that allows the model application. To apply the model on new entities, we fetched the Twitter accounts of three single actors, namely Eminem, Donald Trump, and Peter Gloor. Moreover, we fetched another tribe, the contra-gun-control tribe. The application of our model yields the following results. Members of the contra-gun-control tribe are people who are supposed to like guns. According to our model they are mainly identified as Anti-LGBT (LGBT: 0.286 vs. Anti-LGBT 0.714). These single actors give us a good range of results. Donald Trump (@realdonaldtrump) is identified as an Anti-LGBT with a confidence of 97.6%. Marshall Mathers (@Eminem) is identified as an Anti-LGBT with a confidence of 73%. Finally, Peter Gloor (@pgloor) is identified as an LGBT with a confidence of 60.4%.

Discussion

Looking at our results allows us to draw conclusions to bolster our hypotheses. Results from our Domain Understanding indicate that there are at least two different groups, that highly differentiate in their attitudes toward the LGBT movement (H1). Our LGBT and Anti-LGBT tribes represent the two different groups that are attuned in either a positive or negative way toward the LGBT movement. Positions inside those groups can be (especially in the Anti-LGBT group) versatile in its level of aversion or affection. During the Data Sourcing we built two decent tribes by extensively manual inspecting every Twitter account for its veracity of attitude, that is desired for the respective tribe. Therefore, researchers can use the data from our tribes for further analyses as a solid fundament for their work. In the stage of Data Interpretation, we show that language and behavior differ between the members of the two tribes (H2). This manifests, for example, in sentiment and centrality measures and also in the word use of the tribe members. After proper Data Preparation, the results of our Evaluation phase indicate a high prediction potential for analyzing the used words by users (H3). Including bag-of-words features shows an improvement of around 8% in cross-validation. Interestingly, more general terms such as pronouns and conjunctions are shown to be more meaningful for our prediction value than more goal content-specific words. Demonstration inside our Deployment phase indicates that there are LGBT tribe classified people in the contra-gun-control tribe. However, the proportion of Anti-LGBT classified people in the contra-gun-control tribe is significantly bigger. Therefore, our hypothesis H4 can be obtained.

While we achieve strong results that are intended for satisfying support of our hypotheses, several limitations have to be taken into account. A rising quantity of data impedes the process of machine algorithm calculations and the preparation of sound prediction models, which leads us to limit the data quantity.

Nevertheless, concentrating on a limited quantity of data enables quality improvements like aiming at manually minimizing poor data as fake accounts and fake tweets, even if we do not emphasize nor quantify this procedure further. To improve the quality of our predictions we mainly focused on accuracy. We do not minimize the complexity of used features as we want to ensure a maximal accuracy irrespective of performance efficiency aspects. We do not investigate possible trade-off effects on accuracy and performance by limiting or adding different prediction features. In consideration of practicality aspects, we also consider the option of developing a more user-friendly IT-artifact as a proficient way for suitable applicability. Our present approach is more of a “do-it-yourself” one. Furthermore, our model is strongly attached to a certain domain. While it does well in the LGBT context, there is no proof that our procedure performs on the same level in other domains of use. Interested scientists could aim at diminishing the abovementioned limitations by elaborating on our research in further investigations. Also, we propose to expand the domain field of application to other areas. Applying our model to other tribes, like, for example, religion tribes, can provide insights into effects from tribe affiliations (like religious affiliation) on attitudes toward the LGBT movement. Our model also provides opportunities in the field of tailored marketing. Identifying a person’s attitudes about a certain field can lay the foundation to create customized advertisements in the next step. Because this approach is likely to be manipulating, moral issues should be taken into consideration though. All in all, our work offers various insights into machine learning techniques for identifying attitudes from Twitter language and behavior plus a well-applicable model for the domain of the LGBT movement. While there is potential for further investigation, all of our previous formulated hypothesis can be obtained.

Conclusion, Outlook, and Limitations

In our work we showed how machine learning techniques allow us to conclude from users’ behavior and language on Twitter to their attitudes about the LGBT movement. For this project we used an adjusted procedure of the CRISP-DM process. By identifying two groups of contrary attitudes toward LGBT, we created two virtual tribes by using the Tribefinder tool. We showed that language and behavior of users in the respective tribes differ. Furthermore, we identified word analyses as a valuable means of prediction. Thereby, specific terms are not as decisive as general ones like pronouns or conjunctions. Applying our model on the dataset of the contra-gun-control tribe reveals that the proportion of Anti-LGBT classified people in the contra-gun-control tribe is significantly bigger than LGBT classified people. The application of our prediction model on single Twitter accounts to identify a single users’ attitudes toward the LGBT movement gives us comprehensible results. Further research could investigate how higher data quantities affect the model’s quality. Furthermore, investigations could aim at applying our approach and model in different domains than the LGBT movement.

References

J. Bollen, B. Gonçalves, I. van de Leemput, G. Ruan, The happiness paradox: your friends are happier than you. EPJ Data Sci. 6, 497 (2017). https://doi.org/10.1140/epjds/s13688-017-0100-1
Article Google Scholar
CNN Library, LGBT Rights Milestones Fast Facts (2015). https://edition.cnn.com/2015/06/19/us/lgbt-rights-milestones-fast-facts/index.html. Accessed 8 Mar 2019
Cambridge Dictionary “Tribe” (2019). https://dictionary.cambridge.org/de/worterbuch/englisch/tribe. Accessed 8 Mar 2019
B. Cova, V. Cova, Tribal marketing: the tribalisation of society and its impact on the conduct of marketing. Eur. J. Mark. 36, 595–620 (2002)
Article Google Scholar
D. Pritchard, Tribal participation and solidarity in fifth-century athens: a summary. Anc. Hist. 104–118 (2000)
Google Scholar
D. Quercia, M. Kosinski, D. Stillwell, J. Crowcroft, Our twitter profiles, our selves: predicting personality with twitter, in IEEE Third International Conference on Privacy, Security, Risk and Trust (PASSAT), 2011 and 2011 IEEE Third International Conference on Social Computing (SocialCom) 9–11 Oct 2011, (Boston, Massachusetts, USA, 2011). Proceedings [including workshop papers. (IEEE, Piscataway, NJ, 2011), pp 180–185
Google Scholar
Galaxyscope.galaxyadvisors.com. https://galaxyscope.galaxyadvisors.com/tribe/donaldtrump. Accessed 11 Mar 2019
J. Brownlee, A Gentle Introduction to the Bag-of-Words Model. Abgerufen 14. März 2019, von (2017).https://machinelearningmastery.com/gentle-introduction-bag-words-model/
J.M. De Oliveira, P.A. Gloor, GalaxyScope: finding the “truth of tribes” on social media, in Collaborative Innovation Networks (Springer, Cham, 2018), pp. 153–164
Google Scholar
M. Apelt, Forschungsthema: Militär: Militärische Organisationen im Spannungsfeld von Krieg, Gesellschaft und soldatischen Subjekten, 1. Aufl. VS Verl. für Sozialwiss, Wiesbaden (2010)
Google Scholar
Merriam-Webster Dictionary “Tribalism”. https://www.merriam-webster.com/dictionary/tribalism. Accessed 8 Mar 2019
P. Gloor, A.F. Colladon, J.M. de Oliveira, P. Rovelli, Identifying Tribes on Twitter through Shared Context (2018)
Google Scholar
P.A. Gloor, Sociometrics and Human Relationships. Abgerufen von (2017). https://www.emer-aldinsight.com/doi/abs/https://doi.org/10.1108/978-1-78714-112-420171027
J.W. Pennebaker, The secret life of pronouns. New Sci. 211(2828), 42–45 (2011). https://doi.org/10.1016/S0262-4079(11)62167-2
Article Google Scholar
D. Preoţiuc-Pietro, S. Volkova, V. Lampos, Y. Bachrach, N. Aletras, Studying user income through language, behaviour and affect in social media. PLoS ONE 10, e0138717 (2015). https://doi.org/10.1371/journal.pone.0138717
Article Google Scholar
R. Wirth, J. Hipp, CiteSeerX — CRISP-DM: towards a standard process model for data mining. Abgerufen 9. März 2019, von (2000). http://citeseerx.ist.psu.edu/viewdoc/sum-mary?doi=10.1.1.198.5133
Rapidminer, RapidMiner Studio-RapidMiner Documentation. Abgerufen 14. März 2019, von (2019). https://docs.rapidminer.com/latest/studio/
S. Benartzi, J. Lehrer, The Smarter Screen: Surprising Ways to Influence and Improve Online Behavior (Portfolio/Penguin, New York, New York, 2015)
Google Scholar
S. Bringay, N. Béchet, F. Bouillot, P. Poncelet, M. Roche, M. Teisseire, Towards an online analysis of tweets processing, in Database and Expert Systems Applications (Springer, Heidelberg, Berlin, 2011)
Google Scholar
Statista, Statista Number of monthly active Twitter users worldwide from 1st quarter 2010 to 4th quarter 2018 (in millions) (2016). https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/. Accessed 8 Mar 2019
Twitter, Inc. Twitter für Unternehmen. https://business.twitter.com/de.html. Accessed 14 Mar 2019. Wikipedia.org https://en.wikipedia.org/wiki/LGBT accessed 06.03.2019
U. Shafique, H. Qaiser, A comparative study of data mining process models (KDD, CRISP-DM and SEMMA) 12(1), 217–222 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Universität zu Köln, Albertus-Magnus-Platz, 50923, Cologne, Germany
Moritz Bittner & David Dettmar
Universität Bamberg, Kapuzinerstraße 16, 96047, Bamberg, Germany
Diego Morejon Jaramillo & Maximilian Johannes Valta

Authors

Moritz Bittner
View author publications
You can also search for this author in PubMed Google Scholar
David Dettmar
View author publications
You can also search for this author in PubMed Google Scholar
Diego Morejon Jaramillo
View author publications
You can also search for this author in PubMed Google Scholar
Maximilian Johannes Valta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Moritz Bittner .

Editor information

Editors and Affiliations

Kozminski University, Warsaw, Poland
Aleksandra Przegalinska
Northeastern University, Boston, MA, USA
Francesca Grippa
MIT Center for Collective Intelligence, Cambridge, MA, USA
Peter A. Gloor

Appendix

See Figs. 12.6, 12.7, 12.8, 12.9 and Tables 12.1, 12.2, 12.3, 12.4.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bittner, M., Dettmar, D., Morejon Jaramillo, D., Valta, M.J. (2020). Virtual Tribes: Analyzing Attitudes Toward the LGBT Movement by Applying Machine Learning on Twitter Data. In: Przegalinska, A., Grippa, F., Gloor, P. (eds) Digital Transformation of Collaboration. COINs 2019. Springer Proceedings in Complexity. Springer, Cham. https://doi.org/10.1007/978-3-030-48993-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-48993-9_12
Published: 29 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-48992-2
Online ISBN: 978-3-030-48993-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Virtual Tribes: Analyzing Attitudes Toward the LGBT Movement by Applying Machine Learning on Twitter Data

Abstract

Similar content being viewed by others

Driving Factors of Polarization on Twitter During Protests Against COVID-19 Mitigation Measures in Vienna

Stance and influence of Twitter users regarding the Brexit referendum

Sentiment and Behavior Analysis of One Controversial American Individual on Twitter

Introduction