Keywords

1 Introduction

Data that people poured into the internet like reactions and comments on the topics have the potential to reveal valuable insights on human emotions. Thus, the analysis of people’s ideas and comments can play a crucial role to understand people’s behavior and response in various ways. With the increasing number of microblogs and social media, people have begun to express their opinions on a wide variety of topics on Twitter and other similar platforms. As they are growing and spreading rapidly these tools became more useful to understand and model various events.

In this chapter, a dataset formed of collected tweets from Twitter was used. Twitter contains a large number of short messages created by the users of this microblogging platform. The contents of the messages vary from personal thoughts to public statements.

As a microblogging and social networking website, Twitter has become very popular and has grown rapidly. An increasing number of people are willing to post their opinions on Twitter, which is now considered a valuable online source for opinions. As a result, Twitter sentiment analysis provides a quick and efficient tool to evaluate public opinion for business marketing or social research. In this project sentiment analysis is done about Covid-19 and Vaccine tweets. First word occurrences and some visualizations were used and sentiment analysis was done.

Sentiment is an attitude, thought, or judgement prompted by feeling. Sentiment analysis is the process of determining and measuring the tone, attitude, opinion, and emotional state of responses. More precisely, it is the concept of deciding whether a specific conversation is positive, negative, or neutral. In our study just negativity and positivity of tweets were categorized.

The rest of this chapter is organized as follows. Section 2 covers the related work. Section 3 describes the methodology. Section 4 presents the results. Section 5 is the conclusions.

2 Literature Review

There are works about sentimental analysis, measuring the of the user, and topic modeling. In the Sentiment Analysis and Influence Tracking using Twitter paper [1], the authors mention that how Twitter data is used as a corpus for analysis by the application of sentiment analysis and a study of different algorithms and methods that help to track the influence and impact of a particular user/brand active on the social network. They used Twitter API, Twitter Streaming API, and Twitter Search API for data collection. For analysis preprocessing, techniques such as tokenization, normalization, and part of speech (POS) tagging are used. To determine the influence of the user PeopleRank and TwitterRank algorithms are used. Using these data collection APIs data can be collected from Twitter easily and ranking algorithms can help to calculate the influence of the user.

In the Detecting Real-World Influence Through Twitter paper [2] the authors investigated the issue of detecting the real-life influence of people based on their Twitter account. For the dataset CLEF RepLab, 2014 dataset is used. Social Network Analysis (SNA), Principal Component Analysis (PCA), bag of words, POS, linear classifiers which are Support Vector Machine (SVM) and libLinear, logistic regression, logic boost, multinominal Naïve Bayes are used for determining real-world influence. Since bots are not real influence in the real world this is helpful to detect someone’s real influence value. In the Topic Modeling of Twitter Conversations paper [3], the authors presented a way to analyze large amounts of textual data from Twitter conversations efficiently and effectively. Specifically, it was explained how to capture the narratives that people share on Twitter about social events, reduce their complexity, and provide plausible explanations. For this Latent Dirichlet Allocation (LDA) method is used. By using this method, the topics from contexts can be extracted efficiently and effectively.

In the Extracting health-related causality from Twitter messages using natural language processing paper [4], the authors evaluated an approach to extracting causalities from tweets using natural language processing (NLP) techniques. Twitter Streaming API is used for dataset collection. To extract causality, lexicon syntactic relations and NLP pipeline operations which are lemmatizing, POS and dependency parsing are used. Since a good causality relationship sentence results in the good influence of a person when a reader reads that sentence so that this can be used for determining the influence of the user. However, because there are so many distinct methods to express cause and effect relationships in a phrase, it’s difficult to keep track of them all.

In the Investigating the Relationship between Trust and Sentiment Agreement in Arab Twitter Users paper [5] the authors proposed a research methodology framework for investigating the relationship between trust and sentiment agreement on Twitter and explain the framework by applying it to a use case from Saudi Arabia. For this, the adaptation of the EigenTrust Algorithm which is the MarkovTrust algorithm is used. Also, surface analysis, deep analysis, and shallow analysis algorithms are used to determine the relationship between trust and sentiment agreement. Since the context and sentiment have been taken into consideration, determining the trust of the user will be more accurate.

In the Influence Analysis of Emotional Behavior and User Relationships Based on Twitter Data paper [6], the authors analyzed the influence of emotional behavior on user relationships based on Twitter data using two dictionaries of emotional words. For the collection of data random sampling, for calculation emotion score Keyword Matching, and the testing Brunner-Munzel test is used. By looking at emotional behaviors the influence of the user can be determined.

To sum up, the related work is summarized in Table 1.

Table 1 Related works

3 Methodology

3.1 Data Collection

Implementing the sentiment algorithm and using it for further steps in the project, as well as a data collection technique. Collecting the data from a social media website was done through a scraper. A scraper is a type of software used to copy content from a website. In this project Snscrape was used for this purpose. Snscrape is a scraper for social networking services (SNS). It scrapes things like user profiles, hashtags, or searches and returns the discovered items, e.g., the relevant posts.

Shown in Fig. 1 is an example data collection that were taken from Twitter and transformed into csv file.

Fig. 1
figure 1

Example Covid-19 tweet data from Snscrape

3.2 Preprocessing

The preprocessing steps are:

  1. 1.

    Lower Tweets: Text are converted to lowercase.

  2. 2.

    Remove the URLs: Links starting with “http” or “https” or “www” are replaced by empty string.

  3. 3.

    Remove mentions, retweet and hashtags: Words starting with “”, “#”, “RT” are removed.

  4. 4.

    Remove symbols: Emoticons, symbols and pictographs, transport and map symbols, flags, other language characters and dingbats are removed.

  5. 5.

    Remove non alphabet characters: Replacing characters except Digits and Alphabets with a space.

  6. 6.

    Remove consecutive letters three or more: 3 or more consecutive letters are replaced by 2 letters. (eg: “Cooool” to “Cool”)

  7. 7.

    Remove punctuations: Punctuations are removed from the sentence since it is not affecting the meaning of the sentence.

  8. 8.

    Remove stopwords: The stopwords are not add much meaning to a sentence.

Shown in Tables 2 and 3 are examples of data and results before and after preprocessing.

Table 2 Tweet examples
Table 3 Preprocessed tweet examples

3.3 Vectorization

In this part every single word occurrence was counted to fill the word occurrence matrix with words and their number of occurrences. This can be counted as n-grams. An n-gram is a contiguous sequence of n items from a given sample of text. In our case n is equal to 1, which means single word was counted not group of words. After vectorization, we obtained one word occurrence matrix for each csv file.

3.4 Sentiment Analysis

There are different types of sentiment analysis types, some of them are; polarity and subjectivity analysis, positivity and negativity analysis, emotion detection. Our project includes positivity and negativity analysis meaning that the result for every tweet is positive or negative. While implementing this, the Naive Bayes Classifier method from TextBlob library in Python was used. The Naive Bayes Classifier is wrapping the same named method from NLTK library in Python and this method classifies movies using a pre-trained model, or the coder can manually train the model with related data. We choose the second approach and trained the model with our labeled tweets dataset, then tested and accuracy was found. Finally, the unlabeled data was given to model and obtained their positivity and negativity values.

3.5 Visaulization

The results were all numbers, but they are more meaningful when visualization is good. So, the Matplotlib library of Python was used to draw bar charts, plots, and pie charts. Wordcloud method from TextBlob library was also used for more colorful results for word occurrences.

4 Result and Discussion

In this study, four different Dataset were analyzed. Two datasets from December 2020 about Vaccine (380,000 tweet) and Covid-19(318,000 tweet) and two dataset from January 2021 about Vaccine (500,000 tweet) and Covid-19(212,000 tweet). Accuracy of the sentiment analysis algorithm after training is determined as “0.6”.

In this section, the results of the visualization process and criticization of the results are included. The bar charts and word clouds are the result of vectorization. The table shows us the sentiment analysis result for each dataset.

By considering the datasets collected in December, 2020, occurrences of the most common words related to “Vaccine” in the analyzed tweets are shown in Fig. 2. Occurrences of the most common words about COVID are displayed in Fig. 3. The same two results for the data collected in January 2021 are shown in Figs. 4 and 5, respectively. Comparing Figs. 2 and 3 with Figs. 4 and 5, respectively, it is obvious that the number of occurrences for the common words decreased from December 2020 to January 2021. This may be attributed to various factors, including the following. December is mostly characterized as a vital month with holidays season where people organize a lot of indoor and outdoor activities, travels, etc. On the other hand, January is considered a calm month where people recover from the activities and travel they completed in December. Thus, the drop in the interest in the covid and vaccine can be seen as normal. Further, in January, people are more uninterested in discussing the pandemic after one year of suffering from its health, societal and economic consequences. People tend to be more interested in returning back to normal life style. The most important words discussed during these two periods for “Vaccine” and “Covid” related tweets are reflected in the word clouds shown in Figs. 6, 7, 8 and 9. The related to sentiments for these two periods (December 2020 and January 2021) concerning “Vaccine” and “Covid” related tweets are shown in Figs. 10, 11, 12 and 13.

Fig. 2
figure 2

Most occurred words in tweets about vaccine in December, 2020

Fig. 3
figure 3

Most occurred words in tweets about Covid in December, 2020

Fig. 4
figure 4

Most occurred words in tweets about vaccine in January, 2021

Fig. 5
figure 5

Most occurred words in tweets about Covid in January, 2021

Fig. 6
figure 6

Wordcloud of tweets about vaccine in December, 2020

Fig. 7
figure 7

Wordcloud of tweets about Covid in December, 2020

Fig. 8
figure 8

Wordcloud of tweets about vaccine in January, 2021

Fig. 9
figure 9

Wordcloud of tweets about Covid in January, 2021

Fig. 10
figure 10

Sentiment of tweets about Covid in December, 2020

Fig. 11
figure 11

Sentiment of tweets about vaccine in December 2020

Fig. 12
figure 12

Sentiment of tweets about Covid in January 2021

Fig. 13
figure 13

Sentiment of tweets about vaccine in January 2021

5 Conclusion

As a result of this study, several conclusions could be derived. First of all, for the sentiment analysis algorithm, 0.6 accuracy was determined. This accuracy can be developed with further methods of preprocessing or with a better and much more efficient training algorithm. Also, the algorithm include just positive and negative evaluation. This can be expanded thorough more complex and a better algorithm with adding the neutrality. Even further, some evaluation techniques can be used with different degrees. All these evaluations are effective in our results. We can see the most occurred words in the tables and changes through the months in that trend. Also we see that negativity is seen more in the Covid tweets, whereas positivity is seen more in Vaccine tweets. But this result can be doubted since accuracy is 0.6 and also algorithm omits the neutral tweets. These results should be considered for further developments and works.