1 Introduction

Movies are characterized in several ways; some of them are its type (for instance, cinema, animation, documentary, flash), duration, background score and emotional or suspense quotient to name a few. One of the most important traits of a movie is it’s genre. There are several genres prevalent like sports, horror, romantic, thriller, crime and fantasy. A movie’s genre speaks about it’s basic theme. A movie may belong to only one genre or may have a combination of multiple genres in it’s various parts (generally the case). Genres are mostly decided manually by the director of the movie or some experts like critics. It is imperative that a movie’s genre be correctly identified as a major audience decide to watch a movie on the basis of it’s genre, that reflects the probable content of the movie.

With the fruits of machine learning, it is relatively easy to identify a movie’s genre using algorithms that use past information for training and then can make reliable predictions. This automation has made the task faster and simpler; the major concern ofcourse being the effective performance of the classifier in making predictions.

If a movie’s genre is known, it can be used in several applications. One of these is predicting a movie’s box office performance. The authors in [16], produced a genre specific empirical analysis using basic and extended regression model to identify key factors that determine the success of a movie at the box office. They based their study on two genres: computer animation and comic based films and used a dataset spanning thirty years. They deduced in their analysis that actors with exceptional popularity, award nominations and production budget play an important role in deciding a film’s success. They specifically highlighted the fact that genre preference affects the movie choices of the consumers.

The most relevant movie genre application is in designing movie recommendation systems that extend recommendations to the audience in view of their past movie likings. Several such algorithms have been proposed by academia. In [3], the authors propose a movie recommender system set to solve the well known cold start problem during collaborative filtering. The recommender system is based on category correlations which include movie genres provided by directors and experts. They computed genre correlations in two ways; one on different number of movies and the other over movies across several decades. Their experiment on GroupLens movie database showed that precise recommendations could be made using decade-based genre correlations. A movie recommendation algorithm built on genre correlations was proposed that measures correlation between genres (using ratings given by users) based on a genre probability and weight and further classifies movies based on these correlations [13]. This classification list is then recommended to the users. But the algorithm is posed with the issue of data sparsity. A movie recommender system developed on movie genre preference was proposed in [2] that utilizes neuro-fuzzy decision tree (NFDT). User reviews for multiple genres and their star ratings are used for training.

Another movie recommender system was developed in [17] that uses genre similarity and preferred genres. Genre similarity has been obtained using Pearson correlation coefficient and hence clusters have been obtained using k-nearest method. A recommendation system built on user comments and reviews on Youtube was described in [9]. The authors used their Morphological Sentence Pattern (MSP) model to extract relevant aspects and expressions. Then these were used to derive genre similarity based on tf-idf vectors and a genre score which is proposed as a measure of correlation between a movie and the genres. Post this, K-Nearest Neighbour and K-means clustering are used to group movies based on their genre similarity. Results for 100 movie reviews and 2,000 YouTube comments each for a total of 100 movies have been given and found quite satisfactory. In the next section, we discuss several such algorithms proposed in this domain.

The major contribution of the paper is as follows:

This paper intends to propose a multi-label movie classification scheme built on the movie’s subtitle. Subtitles are a complete account of the movie’s content. We would like to clarify here that by subtitles we do not mean the lines appended to the main title. For instance, for the movie “Holiday: A Soldier Is Never Off Duty”, “A Soldier Is Never Off Duty”, is a subtitle for the main title “Holiday” but this is not the subtitle we refer to. By subtitle, we mean the dialogues or sounds displayed with each scene at the bottom. Subtitle files are readily available that contain the movie’s dialogues, sounds, exclamations, a description of the background etc. Subtitles hence are the actual representation of the complete movie’s script. The classification is multi-label which means the algorithm predicts multiple genres for a movie. In our knowledge, the work is novel as no scheme based on subtitles has been proposed yet. The proposed scheme can be extremely useful in designing genre based recommender systems for the audiences, predicting box office collections for a movie through text analysis of the readily available movie subtitles. The experimental results indicate promising performance of the proposed scheme.

The next section provides a review of some of the important schemes proposed in this area by several researchers. Section 3 describes the proposed classification scheme built on subtitles. Section 4 presents and discusses the results obtained by the classification model built from the proposed scheme. The last section concludes the paper.

2 Literature review

Several movie classification schemes based on it’s genre have been proposed that can also be used to predict the genre of a movie and further use the knowledge to develop recommender systems. The features used for classification have been taken from sources like movie’s plot, images, trailers or visuals. Below, we discuss some well known techniques designed on these features.

Several models have been derived from the movie plots. An approach based on Wikipedia movie plot has been given that detects fractions of a genre present in a movie [20]. Text mining technique has been used to develop bag-of-words with frequencies 1,5 and 15. A corpus for 20 genres was created and the results have been produced by training on 540 movie plots with best results reported on refined corpus with word frequency > 15. Topological data analysis was used as a tool to build a movie genre classification system [6]. Persistent homology was used to quantify topological features in movie plot summaries. Term frequency matrices were generated for top words for a genre and barcodes for the same were created. Movie genres were identified by comparing one-dimensional holes/loops in each barcode. A Jaccard score of 54.8% was reported during performance evaluation on 250 movies and 4 genres.

Images can prove immensely useful to assign labels in varied classification problems. In [25], the authors propose a CNN based sketch-based image retrieval (SBIR) system to recommend images that are similar to a sketch. Similarly [8], proposes an image classification scheme for plant species identification. Hence, its use is well pronounced in recommendation systems. Some approaches have been developed on movie poster images. A deep neural network based model was proposed in [4] that classifies movies into genres based on movie posters. A convolution neural network was trained to extract a visual representation and then objects were detected in the posters. The approach was tested on 8191 images with 23 genres and the classifier assigned probabilities for each genre; the thresholds of which were decided by a grid search scheme. Movie posters were used to extract semantic features for movie genre classification. Twelve meaningful features were derived including theme, layout, emotion and dominant color. After computing these values, classification was done using five multi-label algorithms: Multi-Label kNN (MLkNN), Binary Relevance, Classifier Chains (CC), RAndom k-labELsets (RAkELd), Label Powerset (LP) combined with 3 classifiers Multinomial Naive Bayes (MNB), C-Support Vector Classification (SVC) and Random Forest Classifier (RF) [23]. Multinomial Naïve Bayes with Label Power Set gave a Jaccard score 41.78% while testing with 18 genres and MovieLens 100k dataset. An approach for movie genre classification utilizing low level features like color and edges from movie posters was proposed in [15]. The extracted features were used to train the classifiers distance ranking, Naïve Bayes and RAKEL. Results were obtained for 1500 posters and 6 movie genres with an accuracy of 67% for at least one of two correctly detected labels.

Movie trailers that possess both audio and visual components are also significant in predicting movie genres. A multi-label movie genre classification approach built on movie trailers was proposed that used deep convolution neural networks. The method named Convolution-Through-Time for Multi-label Movie genre Classification (CTT-MMC) uses an ultra deep neural network with residual connections and a convolutional layer to redeem temporal information from image based features [26]. A scheme derived from movie previews used audio-visual features of previews to classify movies into genres [19]. First, movies were classified into action and non-action based on visual disturbance and average shot-length. Then, color, audio and cinematic attributes were used to further classify into genres like comedy, horror and drama. Features like light intensity, sudden changes in audio level, motion were used and tested against thresholds for classification.

A meta-heuristic optimization algorithm termed Self-Adaptive Harmony Search (SAHS) was used to extract relevant audio and visual features from movie trailers [12]. The extracted features were then fed to an SVM to assign a genre to the movie. Experiment was conducted on 223 movie trailers and a total of 277 features were determined and 25 of them were used for classification. An accuracy of 91.9% was reported and it was seen that audio features were more relevant than visual features. Scene categorization from movie trailers was used in [28], for movie genre classification. The trailer is fragmented into keyframes using shot boundary analysis. Then, scene extractor and descriptor schemes namely GIST, CENTRIST and W-CENTRIST are used for extracting features which are then used to classify the movie genre using nearest neighbor approach. The scheme was tested on 1239 movie trailers over 4 genres and the best accuracy reported was 74.7%. Probabilistic latent semantic analysis (PLSA) was used in [11] to classify movie genres using movie previews. Audio and visual features were derived from the previews and text is obtained from social tags. Three models have been proposed: Standard PLSA for using only one feature from audio, video and text, double PLSA using two features and triple PLSA using all the three. Experimentation on 140 movie previews and their tags and 4 movie genres show that the triple PLSA scheme performs best and the authors also highlight the significance of the text feature in the same.

A classification model based on movie scenes was proposed in [14] wherein the authors classify the scene into eight emotional categories. The features were derived using an affective audio-visual words (AAVWs) method that was built upon the tf-idf technique. For labeling the features to an emotion, the authors present a model named latent topic driving model (LTDM) as a combination of topic model based on latent Dirichlet allocation and an emotional model estimating sequence of emotions through past scenes. LTDM uses conditional probability for classification. Their results based on SAR highlight the good performance of the model.

A scheme for identification of starring characters in movie scenes was proposed in [10]. The technique termed DeepStar was designed to detect main characters in a scene by extracting clear faces from the scene, face clustering using robust deep features and selecting the starring characters by generating an occurrence matrix. This work is potentially useful for movie analysis for instance, movie summarization and indexing. Another approach significantly useful in movie summarization is proposed in [24]. The technique has been designed to be able to provide user preferred summarization. The tools deployed include an entropy-based shots segmentation, computation of temporal saliency of shots to improve detection of character faces and facial expression recognition using trained deep CNN model to classify into seven emotions. An attempt on similar lines includes the paper [27], wherein the authors propose a framework for emotion detection in a video using emotion recognition system, emotion attribution and emotion based summarization. The authors use an auxiliary emotional image dataset to improve the performance of the emotion recognition system.

A unique movie genre classification scheme that utilized the movie’s music score was proposed in [1]. The authors analyzed instrumental music using timbral and select rhythm features to classify the genre into Action, Drama, Romance and Horror. Support vector machines were used as the classifier. An investigation on music score of 98 movies showed best results for action while drama and romance were least distinguishable. The scheme could further be improved using more features. A song extractor was proposed in [7] that fragments the movie into musical and non-musical segments and produces the song part. Also the genre of the song is predicted based on audio and video sequences in the song. Three genres were identified: tragic, pop and romance. The classifier is built using SVM. On a dataset of 105 movies, an accuracy of 89.5% was reported.

Some schemes have been devised from a combination of different features. An interesting work has been presented in [18] that predicts the genre of a movie by deriving features from the movie’s synopsis and an image description. The approach uses a measure of thematic intensity created using the synopsis text and color and activity features from images. The method has been tested on 107 animated movies for estimating their drama content and reports a precision of 78%.

A genre classification scheme for Flash movies based on Bayesian classifier was presented that classified the movie into six genres using 10 features some of which are movie length, amount of user interactions, number of event sounds and embedded images/videos [5]. The performance of the approach was tested on 2000 Flash movies and an average accuracy of 72.4% was reported. The authors in [22], gave a movie classification scheme based on nine movie type indicators (MTI) like Fun, Serious and Eye-Catching. The authors used principal component analysis to derive the relevant features from the audience reviews and then deployed K-Means clustering to cluster the movies using these MTIs.

Most of the schemes mentioned above utilize the movie’s plot, images, trailers, videos or scenes or a combination of these to predict it’s genre. In this paper, we use the yet unutilized movie’s subtitles to build our multi-label genre classification technique described in the next section.

3 Proposed multi-label movie genre classification scheme

In this section, we propose the multi-label movie genre classification algorithm that is developed on the movie subtitles. The model broadly consists of five modules: Data collection and preprocessing, Feature extraction, Feature selection, Building the data set and Training.

In the following subsections, we provide details of the processes involved in each module.

figure a

3.1 Data collection and preprocessing

The movie subtitles can be considered as a script-aligned description of the movie in terms of the various dialogues and actual sequences. Sometimes, it also reflects on the background score, character expressions and utterances. First, a set of movie subtitles are collected. These are the .srt files that majorly contain two types of information for each scene:

  1. 1.

    Some text representing the audio and visual content of the scene

  2. 2.

    Time sequences

For instance, Fig. 1 shows the first three subtitle entries for the movie “Clash of The Titans”. Hence the data comprises of words and time values. Before the data can be used for prediction, it needs to be preprocessed. Major filtering done in this module involves:

  1. 1.

    Removing the time values as they do not play any role in prediction

  2. 2.

    Converting the text to lower case

  3. 3.

    Removal of stop words like “and”, “the” as they are irrelevant

Fig. 1
figure 1

First three subtitles from “Clash of the Titans”

3.2 Feature extraction

After the data is preprocessed, relevant features for training the machine learning algorithm need to be identified. The features being talked about in this case are those words that can help decide the genre of the movie. For instance, for a movie belonging to the “sports” genre, some of the expected words in the subtitle would be “game”, “play”, “score”, “team”, “victory”, “lost” and many more. Our submission here is that the frequency of occurrence of these words would be much higher in a movie belonging to the “sports” genre than a movie belonging to “romance” or “horror” category. Hence, our algorithm tries to extract these words that can serve as our feature set. Below we describe this process in detail. Consider there are M subtitle files. We build a dictionary that stores items of the form < wi,fi >, where wi denotes the ith word and fi denotes the combined frequency of the ith word in all the M subtitle files. Initially the dictionary is empty. It is built by the following steps:

  1. 1.

    Each subtitle file is read and tokenized into words. Say, N words are found in the kth subtitle.

  2. 2.

    For each word wi, iε[1,N], its frequency of occurrence, fi, is computed.

  3. 3.

    Now, wi is located in the dictionary. If the word wi is found, say at jth index in the dictionary, the corresponding frequency fj is obtained from the dictionary and updated to fi + fj. If wi is not found, then, a new entry < wi,fi > is made in the dictionary.

Now, the algorithm extracts the words that are most likely to play a role in taking decision on the movie’s genre. These words would serve as the feature set for the machine learning algorithm. For extracting these words, we choose a lower and upper threshold for the word frequency termed as ThreshLow and ThreshUpp respectively. For a total of L words in the dictionary, a word wi, where 1 <= i <= L, is retained in the feature set if:

$$ Thresh_{Low} <= f_{i} <= Thresh_{Upp} $$
(1)

For this, we first rank the words in the dictionary in order of decreasing frequency and then extract the words with corresponding frequencies satisfying (1).

In the present work, we haven’t derived any formal way to accurately deduce the values of ThreshLow and ThreshUpp. We would like to deeply analyze this factor in our future endeavors.

3.3 Feature selection

In this step, a feature selection algorithm is deployed to further extract more relevant features from the domain. Here we aim to choose those features in our data that contribute most to predict the target class. This can help improve the performance of the algorithm especially in case of high dimensional data sets. Feature selection can help reduce overfitting, improve model accuracy and also reduce training time.

In our proposed model, we have used the SelectKBest technique [21] for feature selection that extracts the most relevant features using the univariate statistical chi square test. The technique removes all the features except the K features that score the highest. A chi square test is performed on the sample to retrieve the best features. The function returns the scores obtained and the p-values.

We have tried using different values of K while conducting our performance analysis and some variation in results has been observed. This is an important parameter in our proposed model.

3.4 Building the data set

In this step, the data set is created which keeps a record for each movie. The fields in the data set comprise of the feature set built from words in the step above and the classes (genres in this case). The record for each movie is represented in the data set as follows:

  • For all the words, wi in the feature set, if wi is present in that movie’s subtitle, then the corresponding frequency for wi in the movie, i.e. fi is recorded in the data set. In case, wi is not present in the movie, then, the corresponding feature value is set to “0”.

  • For recording the classes the movie belongs to; we do the following:

    if the movie belongs to a particular genre, then, a “1” is recorded in the column corresponding to that genre. For other genres, the value stored is “0”. Hence, a binary vector with 1s for all genres that the movie belongs to and 0 for the others is recorded in the data set.

This process is followed for all the movies whose subtitles are available.

3.5 Training

Once the data set is ready, the model is trained for building a classifier. A training set with a representation from all the genres is used to train the machine learning algorithm. The model can now be used for classification of a new movie into a genre through its subtitles.

Algorithm 1 presents the pseudocode for the proposed classification model and Fig. 2 illustrates the processes involved in each module of the algorithm. A miniature application of the algorithm is shown on a small sample from the subtitles of the movie “The Last Rescue” that belongs to the genres: “Action, Drama and War”. The complete subtitles from this movie along with several other movies would together formulate the training data set used to build the classifier.

Fig. 2
figure 2

Illustration of the proposed movie genre classification scheme

4 Results and performance analysis

The performance of the proposed algorithm was tested on English subtitles of 964 movies belonging to 6 genres namely: Action, Fantasy, Horror, Romance, Sports and War. Each movie belongs to either one or more genres from the six genres considered. The data source of the subtitles was yifysubtitles.com from which complete .srt files of some real movies from across the world were picked up. The distribution of movies from each genre is as follows:

Action : 223, Fantasy : 223, Horror : 216, Romance : 227, Sports : 185 and War : 219

So, a nearly uniform representation from each genre has been taken. After data preprocessing, we experimented with three different values for ThreshLow, keeping ThreshUpp at 10000 to alter the number of words to be taken for final consideration for the feature selection module. The values of ThreshLow were taken as 100, 500 and 1000. After this, feature selection using the Python SelectKBest technique (uses chi-square to find the most relevant features for the label), was done to select the final set of features to be used for training.

For training, we have used several different machine learning algorithms for obtaining the one that works well for all the genres considered. The algorithms used are Logistic regression, Support vector machine, Naïve bayes classifier, Decision tree, Neural network (multilayer perceptron with three hidden layers each with 20 nodes) and K-Nearest Neighbor (kNN) with K = 10.

In order to prevent overfitting and to obtain a near uniform representation of each genre in the training set, the data set was randomly shuffled and k-fold cross validation with k = 10 was used while training each model.

Average classification results for the ten folds have been presented in Tables 15 available in the Appendix section. Table1 presents the results for ThreshLow = 100. 3223 features were extracted in this case and passed onto the feature selection module from which 2000 best features were selected. This case used the maximum number of features and can be considered as the upper bound for the performance measure. The best results were obtained using multi-layer neural network with the average precision for all genres being 77.4% and the recall being 65.2%. The next best performance was achieved in case of logistic regression with average precision for all genres being 75.6%. In all the cases, it was seen that the classifier could most appropriately predict correct genres for the movies belonging to Sports and War. Good results were also obtained for “Horror” (except for kNN). The average precision approximated below 70% for the genres Romance (though 76% with kNN) and Action which brought down the total performance.

Tables 24 present the results for ThreshLow = 500. A total of 694 features were extracted in this case. With this threshold, we deeply investigated the behavior of the classifier on varying the final number of features. The number of features was taken as 50,100,200,300 and 500. The best average precision was achieved with kNN with 300 features at 76.9%. This can be attributed to the fact that the best 300 words from the 694 high frequency ones would have been used for training. Also, noteworthy is the 75.2% precision with kNN with merely 50 features.

The precision in all cases reduced from the values reported in case of ThreshLow = 100 with exception being kNN. On varying the size of the final feature set, no significant changes were seen in the precision values across models. The precision ranged from [0.688 − 0.704] for logistic regression, [0.593 − 0.603] for naive bayes, [0.633 − 0.649] for SVM, [0.527 − 0.533] for decision tree, [0.718 − 0.731] for neural network, and [0.73 − 0.752] for kNN.

It is noteworthy that neural network and kNN could deliver a precision > 90% for the genres of Sports and War in most cases. kNN also reported good results for Romance. It’s performance majorly degraded in the Horror case.

In Table 5, the results for ThreshLow = 1000 have been shown. The total number of features was 303. The training was done using 150 and 200 features. In this case, kNN gave the best average precision of 77.7% with 200 features (it’s best and overall highest). Though kNN reported good results in this case, the performance of other algorithms showed a minor drop. Neural network gave an average precision of 67.9% followed by logistic regression at 64.2% with 200 features.

Figure 3 presents the results of the precision values obtained in the different cases for all the six genres on applying the six machine learning models. The average precision for all genres using different models has also been depicted. The figure provides a comparative analysis of the performance of the different models when applied to the six genres. It is noteworthy that kNN performed the best for all genres except for Horror and gave the highest average precision for all the genres. It could also give 100% precision in case of Sports for ThreshLow = 100 and 2000 features. Also, it can be seen that with only 200 features, kNN could provide a precision of nearly 80%. The results are hence quite promising and we feel that they can be further improved by deriving a mathematical formulation to obtain the values of ThreshLow and ThreshHigh and working on feature selection.

Fig. 3
figure 3

Results for precision obtained by different models using the proposed algorithm

5 Conclusion and future work

This paper presents an algorithm for movie genre classification based on the movie’s subtitles. The algorithm retrieves the most relevant words that relate to a movie’s genre. The algorithm was run on 964 movies from six genres: Action, Fantasy, Horror, Romance, Sports and War. Experiments were conducted with six models and varying number of features selected through Python SelectKBest module that selects features based on k-highest scores computed using chi-square between labels and features. 10-fold cross validation was used. An average precision in the range of 70 − 80% was obtained with neural network, logistic regression and KNN with the best being 77.7% for 200 features with kNN. Good classification results were obtained for Sports and War. For the other genres too, precision was obtained in the range of 60 − 70% in most cases. As part of future work, we would like to devise a strategy for choosing the values of ThreshUpp and ThreshLow. Also, we would work to improve the algorithm to get better results for the other genres as well. We also envisage extending the analysis to other genres not yet considered.