Keywords

1 Introduction

The motion pictures such as movies and tv shows can be classified into categories which are called genres. However, a movie may have more than one genre. Grouping the movies into broad categories of the genre is yet another challenging classification task to be performed. This activity of grading movies in classes helps both the viewers as well as the critics to draw various conclusions. Labeling the movies into different genres gives more clarity on the type of viewers it shall attract. These trends and results on the genre preferred more by the people will also help the film directors and creators in making the films or shows. In this work, a method of movie classification and sentiment analysis has been proposed.

In our research, the concept of data classification algorithm is used [1]. Classification, being a supervised learning method in machine learning, helps to map the observation to a set of categories which, in this task, helps to identify the genre of a particular movie with the help of its description. Some famous applications of classification are the spam-ham classification of a given email, diagnosing a patient on the basis of the observed characteristics (certain symptoms, blood pressure value, sex etc.) in hospitals etc.

In our work, classification helps in grouping the descriptions of movies into six broad categories or genres which are drama, horror, comedy, western, thriller and documentary. The endeavor is to create a system that can perform rigorous classification.

Multilabel classification techniques involve assigning an instance of an attribute to more than one class label. News articles are a common example of this.

Following are the classifiers that are mostly used in multilabel classification.

One-versus-rest (OvR)

One-versus-rest (OvR) is also called as One-versus-all (OvA) method [2]. In this, a real-valued confidence score is created by the base classifier for its decision, instead of a class label. OvA learner constructed from binary classifier performs a training algorithm where inputs are a learner L, samples X, labels y where yi ∈ {1, …, K} (for some sample Xi) and the output is a list of classifiers f(k) for k ∈ {1, 2, …, K}. In order to predict the label k for a classifier, we apply the classifiers to an untold data sample x that gives the maximum confidence score:

$$\widehat{\mathrm{y}}=\mathrm{argmax }{\mathrm{f}}_{\mathrm{k}}\left(\mathrm{x}\right) ,\mathrm{ k}\in \{1\dots \mathrm{K}\}$$
(1)

OvR with Support Vector Machines

SVM alone supports only binary classification. Therefore, in order to handle the separation of multiple classes, essential parameters and constraints are also added in these extensions [3].

OvR with Logistic Regression

The logistic function is an S-shaped curve (like the sigmoid function) which helps in mapping the values between 0 and 1 by taking real numerical values. It uses Euler’s number (e) which is the base of the natural logarithms [4]. Logistic regression is a linear method that predicts the probability and transforming it using a logistic function. The equation can be represented as

$$1/(1+{\mathrm{e}}^{-\mathrm{value}})$$
(2)

Binary Relevance with Gaussian NB

Binary relevance (BR) is considered as the main baseline for classification in machine learning. The BR method is based on the assumption of independent labels. Hence, the classifier studies each label independently and declares it as irrelevant or relevant. According to several matrices, BR is not only effective in producing ML classifiers but is also computationally efficient. BR together applied with Gaussian Naive Bayes stimulates the model for the multilabel classification. Naive Bayes is based on the principle of MAP (maximum a posteriori) [5]. It is an efficient and popular classifier.

Label Powerset with Logistic Regression

It is used in multilabel classification [6].

Sentiment analysis is an area under natural language processing (NLP) that helps in identifying the sentiment within a text. It is, therefore, also known as opinion mining [7]. Sentiment analysis works on the unstructured data of raw texts and converts them into structured data that can be useful for any brand, organization, politics etc. In our work, we have applied sentiment analysis on the tweets of the Twitter users to find out which genre of movies/tv shows the users like the most and classified the tweets as positive, negative or neutral with the help of TextBlob. TextBlob is a powerful python library that can be used for sentiment analysis and offers a simple API to access its method and accomplish standard NLP operations. TextBlob analyzes an English phrase in the form of a score. Each lexicon has the scores for polarity, subjectivity and intensity with their different specified ranges. The polarity defines if the sentiment for the text is positive, negative or neutral which helps us to understand what people actually think related to movies of a particular genre. In this way, while implementing sentiment analysis we get a general public view over the Twitter platform of the most favorable genre or the trending genre and movies of those genres that can have praising outcomes on the screen which can help in a good critic rating or even can do a good business.

TextBlob is a powerful NLP library for python that can be used for sentiment analysis. It helps in determining the polarity and subjectivity of the text. We have applied TextBlob to the text of tweets to find out about their polarities [8]. The polarity using the TextBlob ranges from −1 to 1. We classified all the tweets whose polarity < 0 as −1 and the tweets whose polarity > 0 as + 1. Then we calculated the number of positive, negative and neutral tweets in the dataset. After that, we computed the percentage of positive and negative tweets using the formula:

$$\mathrm{Percentage}\,{\rm of}\,{\rm positive}\,{\rm tweets}=\left(\frac{\mathrm{Number}\,{\rm of}\,{\rm positive}\,{\rm tweets}}{\mathrm{Total}\,{\rm number}\,{\rm of}\,{\rm tweets}}\right)*100$$
(3)

Similarly, for negative tweets,

$$\mathrm{Percentage}\,{\rm of}\,{\rm negative}\,{\rm tweets}=\left(\frac{\mathrm{Number}\,{\rm of}\,{\rm negative}\,{\rm tweets}}{\mathrm{Total}\,{\rm number}\,{\rm of}\,{\rm tweets}}\right)*100$$
(4)

This research work focuses on analyzing trends and sentiments of different movie genres. It covers on following points:

  • Using hybrid algorithms to differentiate between multiple labels of the movies. This classification is achieved by using multiple hybrid algorithms such as pipeline for applying SVM with one-versus-rest classifier, binary relevance and Gaussian NB and classifier chains with logistic regression.

  • These algorithms are compared by measuring the accuracy.

  • For each subsequent genre tweets are extracted using the Twitter API.

  • Twitter data is mined for making interpretations.

Sentiment analysis of the genres is done from the extracted tweets. This helps to get insights like evaluating the viewpoint, evaluations, and feelings of a speaker/writer on the social media platform.

The paper is organized as follows: Sect. 2 gives the state of the art in this field. Section 3 highlights the proposed work, which is divided into two subsections. The proposed work includes the flowchart along with pseudo-code. Section 4 shows the dataset used in the work. Section 5 shows the results followed by the conclusion.

2 Related Works

Table 1 shows the summary of work done by different authors.

Table 1 Summary of contributions of different authors

3 Proposed Work

The proposed methodology is divided into two subsections, namely classification of movie genres and sentiment analysis of the genres.

3.1 Classification of Movie Genres

The flowchart in Fig. 1 shows the stepwise procedure performed in the classification of movie genres.

Fig. 1
figure 1

Flowchart of classification of movie genres

In the first step, the data extracted from Kaggle is saved into a csv file for further processing. We have considered three attributes of the data stored, that is, the movie name, description and movie genre, out of which description and genre are used in the data classification system. The second step is the most crucial one which is the data pre-processing. In order to influence the results and analysis, pre-processing and mining the data is one of the most crucial parts. The first part in data pre-processing is creating dummies. Here we build a dummy variable or column for each categorical value in the genre. In this way, we store the numerical value against each description representing the genre. Cleaning the data is the process in which initially, the text is tokenized that is segmented into clauses or words, we clean text by removing the unnecessary data in it which may include tags, punctuations, links, emails, phone numbers and other multiple pointless words which are not essential in categorizing the data and does not help the model. Stop words are the words present in the NLTK corpus which are the commonly used words and are unlikely to be useful for learning. Stemming is the process of generating morphological variants of a base word. A stemming algorithm is the one that reduces or replaces the words to their root word or a common stem. For instance, likes, liking, likely or liking are reduced to like which is their root word.

Stemming helps to reduce redundancy. The data is then segregated into sets: training and testing. Next is the TfidVectorizer, Tfid is the term used for term frequency–inverse document. In TF-IDF, term frequency replicates how often a word appears within a document and marks its frequency and inverse document frequency downscales or removes words that appear a lot across the text. After pre-processing the data, the third step comes in which the system is built for multilabel classification. Since against each movie description we have multiple genres attached to it, we have implemented hybrid algorithms in order to achieve the task of classification. Hybrid algorithms are basically a way of combining models and bringing together the strengths of both knowledge representations.

3.2 Sentiment Analysis of Movie Genres

The second subsection is sentiment analysis of movie genres. The algorithm used for the sentiment analysis of the tweets using TextBlob is given as follows:

  1. 1.

    Use Tweepy to extract tweets from Twitter.

  2. 2.

    Using hashtags of different genres, tweets are extracted for those genres.

  3. 3.

    Merge all the tweets extracted to prepare a dataset consisting of the username along with the text of the tweet.

  4. 4.

    Pre-processing is done to clean all the non-letters in tweets.

  5. 5.

    Analyze the tweets for sentiment analysis using TextBlob.

  6. 6.

    The polarity of each tweet is then found out as + 1, −1 or 0 and is added to the new dataset.

  7. 7.

    The percentage of positive and negative reviews for each genre is then calculated (Fig. 2).

    Fig. 2
    figure 2

    Flowchart for sentiment analysis starting from extracting Twitter data to the polarity of the genre

The pseudo-code for the sentiment analysis is given below. The code demonstrates the method to find the polarity of the sentiment for a given text of lines. Once the polarity of the text is determined by TextBlob, the outcome of positive and negative reviews in tweets is shown and printed for the results.

4 Dataset Used

For the implementation part, we have used the Movie Dataset from Kaggle [21]. It contains metadata of 45,000 movies present in the Full Movie Lens Dataset. This dataset contains 26 million ratings from users. These are rated are on a scale of 1–5 (Table 2).

Table 2 Dataset used

5 Results

Python language has been used for the implementation of the work. Figure 3 shows the percentage of positive reviews tweeted by Twitter users for each genre. According to the graph, the movies/shows of comedy genre received the highest percentage of positive reviews followed by western, thriller, action, documentary, horror and drama.

Fig. 3
figure 3

Percentage of positive tweets per genre

Figure 4 shows the percentage of negative reviews tweeted by Twitter users for each genre. According to the graph, the movies/shows of the horror genre received the highest percentage of negative reviews followed by thriller, drama, western, action, comedy and documentary. As per the experiment results, the comedy genre is seen to have the highest percentage of positive tweets (66.04%), then western (56.4%), thriller (48.3%), action (47.92%), documentary (40.15%), horror (37%) and drama (23%). The horror genre was observed to have the highest percentage of negative reviews (21.2%) followed by thriller (18.5%), drama (12%), western (10.4%), action (10.05%), comedy (8.63%) and documentary (7.57%).

Fig. 4
figure 4

Percentage of negative tweets per genre

Table 3 shows the accuracy, precision and recall corresponding to each of the classification algorithms applied to the dataset.

Table 3 Percentage accuracy, precision and recall of the classification algorithms
$$\mathrm{Accuracy}= \frac{\mathrm{tp}+\mathrm{tn}}{\mathrm{tp}+\mathrm{fp}+\mathrm{tn}+\mathrm{fn}}$$
(5)
$$\mathrm{Precision}= \frac{\mathrm{tp}}{\mathrm{tp}+\mathrm{fp}}$$
(6)
$$\mathrm{Recall}= \frac{\mathrm{tp}}{\mathrm{tp}+\mathrm{fn}}$$
(7)

where tp stands for true positive, tn is true negative, fn is false negative and fp is false positive. The terms positive and negative suggest the classifier’s prediction and the terms true and false allude to whether that prediction belongs to external judgment or observation.

Figure 5 displays the accuracy for each algorithm applied. We can see that the highest accuracy is obtained by using binary relevance plus Gaussian NB algorithm followed by one-versus-rest + SVC, label powerset + logistic regression and One-versus-rest + logistic regression.

Fig. 5
figure 5

Comparison of algorithms

6 Conclusion

In this work, we have done classification of movies into various genres. Further, sentiment analysis of the tweets is done using TextBlob. The Movie dataset is used for the experimentation work. We have used different supervised learning algorithms on the dataset and got the best accuracy using binary relevance + Gaussian NB (85.33%). The comedy genre was observed to have the highest percentage of positive tweets, whereas the horror genre was observed to have the highest percentage of negative reviews. The trends analysis shows that people like comedy shows/movies more than any other genre. It also shows that people are generally more critical of horror movies/tv shows.