Abstract
The Indian movie industry is the largest movie industry based on the number of movies produced per year. It is also the most diverse movie industry.It has been examined in a recent study that only a few of the movies achieved success. Revenue uncertainties have created immense pressure on the motion picture industry. Researchers and film producers continually feel a necessity to have some expert systems that predict the movie’s success probability preceding its production with reasonable accuracy. The diversity of the Indian movie industry makes the problem more difficult. Only a few researchers worked on Indian films, but most of them are targeted prerelease forecasting or have low prediction accuracy. This study focused on Indian movies and concentrated on the upcoming film’s success as soon as a quotient (director, cast) signed an agreement. This proposed forecasting has been considered as the earliest forecasting. Our study retrieved and used the last 30 years of Indian movie information covering all India’s regional movies.We had judicially chosen some of the movie’s intrinsic features and introduced a set of novel derived features to increase the forecasting accuracy. We had proposed a K-fold Hybrid Deep Ensemble learning Model (KHDEM), which includes Deep Learning models (DLM) and ensemble learning models. Finally, We made the prediction using a Logistic Regression (LR) classifier. We had implemented a binary classification model and achieved 96% accuracy, which outperforms all the benchmark models. The introduction of our derived features had improved the accuracy by 17.62%.This study highlights the potential of predictive and prescriptive data analytics in information systems to support industry decisions.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Indian Motion Picture industry has grown up over the last few decades. The expansion of the movie industry has been a worldwide phenomenon. The Global box office is moving first, reaching a billiondollars in 2020 (https://www.statista.com/topics/2140/film-industry-in-india/). India is also not very behind in the global competition. The Indian Box Office revenue is tapping 142 billion Indian rupees(1.91 billion USD) (https://www.statista.com/topics/2140/film-industry-in-india/). Furthermore, it has been projected that the box office revenue will reach 190 billion rupees (2.55 billion USD)by 2024 (https://www.statista.com/statistics/316810/box-office-revenue-india/#:~:text=The%20box%20office%20revenue%20for,rupees%20by%20the%20year%202024). According to the statistics, the number of cinema admission in India went up to 2.02 billion (https://www.statista.com/topics/2140/film-industry-in-india/). The Indian motion picture industry is one of the most diversified movie industries. It includes several regional industries from different parts of India, with Different cultures, languages, and economic barriers. Each regional industry has its local stars, regional songs, and its own set of viewers with social expectations. There are several well-known movie industries in India, such as Bollywood; it is one of India’s most famous movie industries for Hindi movies. On the other hand, Kannada cinema also known as Chandanavana or Sandalwood for Kannada language movies. Telugu and the Bengali cinema industry have the same name, Tollywood, but from entirely different parts of India.Similarly, Kollywood is famous for its Tamil cinema. For Malayalam cinema, Mollywood is the industry name. In this way, there are several Indian movie industries at present in the Indian Motion Picture industry.
The movie industry is enormous; investors or producers are investing several million rupees in this industry. Some top budget movies like 2.0, with a budget of 5.75 billion rupees(77.2 million USD). SAAHO, with a budget of 3.25 billion rupees (43.7 million USD). Thugs of Hindostan, with 3.25 billion rupees (43.7 million USD) (https://www.statista.com/statistics/695635/bollywood-production-budget/). The list goes on. Investors are investing money in the hope of getting back profit out of the investment. Some Indian movieindustries are financially powerful, producinghigh-budget commercial movies, while some make low-budget classic movies. According to the Film Federation of India, in the year 2019 Indian movie industry produces a total of 1986 movies,includingall parts of India (http://www.filmfed.org/downloads/Language-wise-Region-2018-19-26062019.pdf). It delivers about 1500–2000 films in 20 languages every year, nearly doubles the number of films offered in Hollywood, making Indiaone of the most dynamic film manufacturersglobally [42]. A report presented in September 2016 titled “Indywood – The Indian Film Industry” by auditing firm Deloitte has predicted that the Indian film industry’s volume would grow at a CAGR of 11% to reach $3.1 billion by 2020 [13, 42]. Unfortunately, many films created in the motion picture industry do not do great at the box office or meet expectations.It indicates that many producers, directors, actors, technicians, exhibitors, etc., get it challenging when a film does not do well.
A considerable amount of investment has been spent on every box-office film. However, a large number of movies fail to achieve success. Due to the movie sector’s uncertainty, researchers and film producers continually feel it necessary to have expert systems that forecast the movie’s success probability preceding its production with reasonable accuracy. Notable research works [10, 18, 29] have been carried to foretell movie popularity. Previously, many works have been performed on post-production [18, 39] or post-release prediction [10, 29, 41]. The forecast has no advantage as the investor has already spent his capital on film production. Moreover, estimates at the late stage are of no benefit to the decision-maker or investor whose money has already been used. Therefore, predicting pre-production or early production stage [3, 23] with significant accuracy is desirable to ensure investment. Unfortunately, most early-stage prediction works focused on Hollywood movies; and all the previous works’ accuracy is not significantly good.Only a few researchers worked on Indian films, mainly focused on Bollywood movies [44], but most targeted prerelease forecasting or have low prediction accuracy.
In our dataset, we have movies with 15 different languages from different parts of India. We have considered only six basic movie features. In addition to that, we have used all the historical data and introduced three new types of features. We have proposed a robust machine learning model called the K-fold Hybrid Deep Ensemble Learning model (HDEM), which is a combination of conventional machine learning (ML) models and deep learning (DL) models [46]. The proposed model is robust and provides better results in all conditions. Our proposed model outperforms all the previous works.
The principal contributions of this study are as follows.
-
1)
We have created an exclusive data set of Indian movies. The dataset includes movies from several regional industries. The dataset is also publicly available for future research works (https://www.kaggle.com/sahusandipan/imdb-indian-movie-dataset).
-
2)
We have proposed a novel K-fold HDEM model by combining conventional ML models and DL models.
-
3)
Mining all the past data, we have introduced several new derived features, which positively enhance the performance.
The rest of this paper is arranged as follows. Section 2 briefly reviews state-of-the-art approaches to film forecasting are manifested. Then, sections 3 and 4 provide the proposed methodology’s details by revealing the new data retrieval, recommended movie intrinsic features, proposed set of novel features, introducing movie hit success criterion and machine learning model. Next, Section 5 and 6 present the experimental results simultaneously with a comparative study of other statistical models shown and explained finally, research contributions and their limitations and further research directions in section 7.
2 Literature survey
The primary efforts covered the research pointed by Litman [26] and examined the attributes and their impacts on the box-office’s performance. Litman additionally examined the attributes (i.e., critics score, genre, cost of the production, suppliers, theatre release date, and award-taking history of actors). The motion picture industry at the time kept growing since Litman’s study and for the benefit of the progress and popularity of the film has been an inspiring and rising research field; consequently, extensive studies have been published. Prag and Casavant [36] presented an intense interest in classifying the relationship amongst attributes such as costs of marketing campaigns, MPAA ratings, sequels, and success of a film. Movie Hit prediction can mainly be divided according to the timing of the prediction.
2.1 Timing of prediction
Based on the predictiontime,movie success prediction works can be divided into two major groups. 1) Late prediction: many works have been performed where predictions were made just before the movie’s release or after the film’s official release [2, 6, 10, 18, 20, 25, 29, 41, 47, 50]. 2) Early prediction: very few of the works performed where predictions have been made at the early stage of the movie production [1, 3, 23, 31, 44]. Early-stage prediction is most helpful to movie makers and investors. However, unfortunately, early prediction is a relatively complex task. Thus prediction accuracy is comparatively low. On the other hand, to the best of our knowledge, not a single work consider the diversity of Indian movie. In our work, we have addressed both of these two issues. The late forecast may access more features to use and get a more accurate prediction. In the case of early prediction, only a few movie features are available, making the movie prediction much more difficult.
2.2 Features selection for prediction
In the movie success prediction, feature selection, and use of attributes are the most critical part of the model. Three types of features have been explored in the previous works.
Viewer-based feature
Viewer-based features are about possible audiences’ acceptance of a film. As the number of optimistic or positive audiences about a movie increased, the likeliness to have higher revenues increased. Movie response can be recovered from several types of media, such as Twitter [4], trailer remarks [2], blogs [21, 47], and crowdsourced social media [11, 41], and movie reviews [29]. Similarly, in [34], they applied social media YouTube and Twitter comments to predict potential audiences.Unfortunately, viewer-based features are available only for late prediction.
Released-dased features
Released-based features mainly concentrate on the time of release and the movie’s availability. In which part of the country is going to release, and the number of theatres in which the film opens is related to movie availability [23, 35, 39, 40, 45, 48]. The profitability possibility would be high if a movie shows in a higher number of theatres and covers the maximum part of the country. Many films are targeted for release at a particular time. For instance, festival releases and periods and dates of releases (spring, summer, etc.) are usually used in the forecast problem [21, 35, 40, 43]. Some additionally strived to capture the competition at the release time [21, 35], which could negatively influence the popularity.Released-based features are partly available for early prediction.
Movie-based features
Movie-based attributes are directly related to a movie, including cast, director, composer, genre, language, and other features. Both the late prediction and early-stage prediction can access these features.In one of the recent works, authors try to find the impact of movie names on movie success [5].The most common characteristic used for cast members is star power—whether the film casts star-artists. The star power of actors has been carried by actor incomes [35], prior award nominations [10], actor rankings [40], and the actor’s number of Twitter followers [4]. Some past studies have investigated the director’s success and tried to relate it with the movie’s success [27]. Some earlier research has claimed that movies’ economic performance is not influenced by star directors’ appearance [10], and directors’ power is not as significant as actors’ for movie success [30]. Many of the researchers conducted research work using some derived features related to key-rule performers in addition to intrinsic movie features [3, 6, 24, 44]. In addition to an individual key-role member of a film has also been investigated from a teamwork viewpoint whether individuals in a team can perform collectively and generate “team chemistry” [23, 29]. The authors of [12, 16, 28, 32] had inspected the relationship between the actor’s star power and the performance of a film.
2.3 Determine movie success
One of the most critical parts of our problem is how we define the success of a film. No benchmark standards exist which determine the success of a movie. Some works have concentrated on box office revenue [2,3,4, 21, 31, 35, 40]. At the same time, some have used the number of admissions [7, 29]. The underlying assumption is to make revenue or the number of admission as the parameter of success. People usually like to think that a movie that sells well at the box office is also thriving. Some of the past works measured success as profitability. It may be a numeric value of revenue [12] or the return on investment (ROI) [17, 19, 23]. Several works classified movies into two classes (success or not) and selected binary classifications; some estimated the prediction as a multi-class classification problem and attempted to classify films into multiple discrete classes [35]. Predictions are also made on continuous numerical values of profit metrics [18, 23, 45], with values of these metrics comprising logarithmic in some works [12, 40, 48].
2.4 Use of machine learning models
The movie hit prediction relied on machine learning models since these learning techniques have developed forecast models with moderate accuracy [15, 19, 39]. For example, [22, 38, 39] have performed some machine learning models such as discriminant analysis, DT, logistic regression (LR), and Neural Network (MLP) and examined the execution to forecast a movie’s success. Authors of [8, 49] have introduced the multi-layer backpropagation with a more quality improved neural network model proposed by [39]. The authors [33] collected movie data from online websites like IMDb and rotten tomatoes and implemented machine learning approaches like linear and Logistic regression and support vector machine (SVM). Authors [25, 37] Proposed Cinema Ensemble Model (CEM) to enhance prediction precision, composed of seven machine learning models and concentrates on selecting attributes. The study [3, 9] proposed some new features to forecast movie box-office success. They have used a Voting system to predict by averaging the output from different machine learning classifiers.
3 Material and methods
Our research problem is to predict the upcoming movie would be a hit or not. Since we predict the class of a movie, It is evident that our problem is a classification problem under supervised learning. Our objective is to classify movie popularity at the early stage of movie production. Classification at early-stage production has some challenges while selecting movie attributes. Only some specific attributes are finalized at the early production stage of the movie.
3.1 Framework
In this study, fast we have extracted the Raw movie data from the IMDb website by web scraping. Then clean and preprocess the stored data. Next, we have incorporated some novel derived features. Finally, all the data with the feature set fit into our proposed k fold HDEM model for movie classification.
Framework for predicting movie popularity has four steps, which are listed below (Fig. 1).
-
1.
Extract raw data from the IMDb website by scraping and preprocessing the data.
-
2.
Data cleaning and preprocessing.
-
3.
Select relevant features and incorporate derived features to quantify the impact of actors, directors, producers, and composers.
-
4.
Use all relevant features to train the K-fold hybrid deep ensemble learning model (k-HDEM).
3.2 Data extraction and preprocessing
This study strictly focused on Indian movies, and we have included movies from every part of India containing all regional movies. In India, different religion has a different language and also different culture altogether. The data sets used in our study were collected from the IMDb website. The python script has been used for scraping purposes. Used Python library named “Beautiful soup” to extract movie information from the IMDb website. We considered 30 years of Indian movies, which list almost 5000 movies. We use search string in our python script to scroll and reach each movie link in the IMDb web portal. IMDB web portal maintains an individual web page for each of the listed movies. After retrieving the URL of a movie web page, the HTTP request has been sent and gets back the response using a beautiful soap package. Our python script has been used to access all the relevant movie information from the responded movie page, and finally, it has been stored in the CSV file. Each row of the CSV file content attribute of a single movie. The number of rows in the CSV file represents the number of movie data that has been retrieved. From each of the individual movie pages, 19 different attributes have been extracted. Which includes some numerical attributes and also some of them are nominal attributes. After downloading all information, including all the different parts of India, it also comprises the time range between 1990 and 2019. A list of all directors, composers, producers, and casts present in the dataset has been created.We have used a python script to scroll down to each of the listed members’ official pages and retrieve the Facebook like from their official Facebook page.
3.3 Data cleaning and preprocessing
The data attained from the IMDb website needs to be cleaned as the downloaded data are prone to noise. Also, some attribute values are missing due to the attribute value ware not being present on the website. Some of the numerical parameter values are not uniform. For example, gross income is one of the numerical attributes present sometime as US dollar and sometimes as Indian rupees; in that case, we have to convert all the values into a unique unit.
3.3.1 Removing the noise
We have downloaded almost 5000 movie information from the IMDb database. For each of the movies, it consists of several features. First, there are some movies for which maximum attribute values are not present in our downloaded movie database. Next, we need to sort out all those movies with the maximum number of null attribute values. After sorting, we have deleted all records of search movies.
3.3.2 Data transformation
Some numerical parameters like “budget” and “gross income”haveIndian rupees and dollars values. All attributes which consist of multi-unit data are converted into a single unit. Some nominal features in our dataset are also multivalued. Here we have to convert it into a single-valued feature. Our database comprises two types of attributes nominal attribute and numerical attribute, and all the nominal attributes are converted into numerical values (exp. genre, language).
3.3.3 Missing attribute values
After removing all the movie information, which had the maximum number of null values, we left some movies in our dataset, with some numerical attributes with null values. At first, all these null values have been converted to zeros; in the next step of the data preprocessing, we replace all the zero values with the median value.
3.4 Feature selection
Our problem is to predict a movie’s success at the early production stage. According to our problem statement, we could not use all the attributes generally determined late at the production phase. Only a few of the attributes are usually confirmed before or at the early stage of production. According to that, our selection of attributes has to be correctly judged. This selection process has great importance in our research problem. Our limited access to movie-related information makes our movie hit prediction job very tricky and unique. Hence after elimination, we left only four numerical and eight nominal features to make the actual prediction. Selected features are shown in Table 1.
It seems all the nominated features are general types of features. It has been difficult to infer quantitative measurements about the movie, making our job much more challenging. In the above list of numerical features, some of them are nominal or categorical features. We have converted those features into numerical values. Features like “release_month” have 12 different possible values.Features having more than two possible values are converted into n-binary features, where n represents the number of possible values.
3.4.1 Feature description
Genre
is one of the most commonly used attribute values in predicting movie success. In our study, we used twenty different categoriesofgenre values. However, not all categories are significant. Most of the movies belong to more than one number of categories. The most commonly used categories are action, drama, comedy, crime, romance, family, horror, and thriller. These are the eight genres that are commonly used in Indian movies. Genre is a multivalued attribute. Therefore, we convert the attribute into an n-binary feature.
Figure 2 shows the box plot of the relationship between the IMDb rating and genre category. Here we have taken only some of the major categories and kept all the remaining categories under the name ‘other’. Figure 1 shows that the relationship is not strong since some anomalies are present in genre data.
Some of the genres like horror, fantasy have lower average IMDb ratings. On the other hand, movies from the genres like crime, adventure, family, and mystery generally have an excellent response to the audience. Hence, the average rating is higher than the other. Similarly, movies from action, comedy, drama and romance genres have a broad spectrum in IMDb rating. From the preceding analysis, it is clear that the movie genre affects IMDb ratings. For some genres, the effect is prominent, but it does not create a notable effect on movie rating for other cases.
Release month
The timing of the movie release and the genre of the movie are two significant factors. In Indian culture, different festivals are celebrated with different people. The producer may think that festival times are the most appropriate time to release a movie. Again at this time, not all types of movies are useful. Release month is a multivalued attribute. We have convertedit into an n-binary feature. Figure 3 shows the box plot of the relationship between the IMDb rating and release month. From Fig. 3, it has been evident that release month does not affect the movie’s popularity.
Release month is not directly related to IMDb rating. However, it has been observed that type of movie released is highly inspired by the different cultural activities celebrated at different times of the year.In the early part, we observe that the IMDb rating and movie genre have a coherence relationship that establishes an indirect connection between release month and IMDb rating.
Movie language
It is one of the essential features of Indian movies. With the presents of the Hindi movie industry, there exist several regional movie industries. More importantly, several papular movie Industries are present in South India, and each of them operates in their regional language. Also, the Bengali movie industry is quite popular in the Eastern part of India. In our study, we have considered 15 different languages. Consideration of different parts of Indian movies makes our job much more challenging. Movie language is a multivalued attribute. We have converted the attribute into the n-binary feature. Also, movies are often published in multiple languages. Therefore, we have made some modifications to prioritize the language of origin of a particular movie.
Figure 4 presents the box plot to show the relationship between the IMDB rating and movie language. The figure shows that the median rating of some movie languages like Gujarati, Kannada, Malayalam, and Marathi is generally high. Similarly, movies from Oriya, English and Hindi languages have low IMDb ratings. Moreover, movie language has a significant role in IMDb rating.
Movie duration
Movie duration is one of the numerical attributes we have selected as crucial work attributes. The runtime of the majority number of movies falls between 90 to 180 minutes. In Fig. 5, we have plotted movie runtime versus IMDb rating. Figure 5 shows the well functional relationship as the movie duration becomes longer; the IMDb rating also increases.
Movie budget
The movie budget is one of the essential numerical attributes. It is strongly correlated with its box office revenue movie (Deniz and Hasbrouck, 2012) [14]. Litman (1983) [26]. At our work, we have considered a significant span of the period, from 1990 to 2019. In these 30 years,themovie budget has significantly increased in the Indian movie industry (Fig. 6). Also, we considered different regional movie industries. The movie budget is not comparable to each regional industry. The budget of Bollywood movies and South Indian films is comparatively high than the other regional movies.
Figure 7 shows the relationship between the movie budget and IMDb rating. It has been observed that a higher budget movie have a General tendency to have a higher movie rating. However, the connection between the movie budget and IMDB rating is very faint.
Released year
At our work, we have considered Indian movies of the last three decades. Depending upon the movie’s time frame or the film’s release year, the impact of other parameters related to movies may also vary. In Fig. 6, we have presented how the movie budget increases during the period.
3.5 Derived features
In the movie industry, successful and star performers’ presence is always making a difference. Each individual plays an essential role in the movie.Hero, heroine, music composer, the director are the most influential performers. Popular and trending hero heroine composers and directors always have a positive influence on the movie’s success.
Facebook page likes
As described earlier, the number of likes present on the official Facebook page of an actor or actress shows the popularity in social media. The Facebook promotion also plays a significant impact in the success of the movie. At our work, we have introduced the number of Facebook likes as an attribute.
Popularity rank (PR)
As we have discussed, the popularity and acceptance of a star is a significant factor for all movies to become hit or not. We proposed a way to compute the popularity rank of each of the three main actors, director, composer, and producer. The popularity rankof each of the stars performer Pk is between 0 to 10. For each Pk, we have considered each movie Mj where they performed. Now popularity rank PR(Pk) of a performer Pk calculated using the following formula (1). Consider a performer Pk, has performed in a set of movies:\( {M}^k={\left\{{M}_j^k\right\}}_{j=1}^{N^k} \) Where Nk is the no. of movie Pk participated.Popularity Factor(PF) and Year Coefficient(YC) has been defined next.
Popularity factor(PF)
We considered IMDb rating (R), gross income (Gi), Global gross income (Gg), and also the number of votes (V), are the parameters to estimate the popularity or the success of a movie (Mj). Popularity factor PF (Mj) measured the success of a movie (Mj). We have defined PF as follows (Table 2):
We have considered four parameters. According to our observation, we have set a priority sequence as follows R, Gi, Gg, and V. Among all, Rating (R) has maximum priority. The number of votes (V) has the lowest priority since it only shows how many viewers post the rating. Furthermore, the Popularity Factor(PF) values have been set according to the following rules.
-
1.
Total number of conditions satisfied.
-
2.
Priority of the condition.
Due to a few movies, Gi, Gg’s boundary value and V increase abnormally, making the mean value higher. We had selected the median value as the cut-off of each parameter. Such that an even number of samples lies on both sides of the cut-off.
Year coefficient(YC)
It regulates the credit point depending on how recent the movie is. Popularity points for an old hit movie and a recent hit movie are not the same. Therefore, credit has been transferred more for the current movie than the old movie (Table 3).
Year coefficient is necessary to prioritize the movie’s success according to how recent the movie is. All the movies before 2010, YC(Mj) value is 0.50, which means we are giving half the popularity factor PF(Mj). YC is a vital component since a recent hit movie always gives extra publicity to an actor, director or composer instead of a hit movie produced 3 to 4 years back.
At our work, we determined the popularity rank of three significant artists present in the movie (pop_rank_act1, pop_rank_act2, pop_rank_act3), popularity rank of the director of the movie (pop_rank_dir), composer of the film (pop_rank_com), and also the producer of the movie (pop_rank_prod) (Tables 4 and 5). Figure 8 presents the relationship between actor one (main cast) popularity rank with the movie’s IMDb rating. Popularity rank has a patterned relationship with the movie rating.While the popularity rank increases, the movie’s rating also gets increased. Figure 9 displays the relationship between the director’s popularity rank and the IMDb rating of the movie. It seems that popularity rank has a steady relationship with the movie rating. One can observe that the popularity rank increases with the movie rank.
Star experience
In the movie industry, the experience and popularity of star performers influence potential viewers. Every senior star actor, composer or director, has their fan followers. In the motion picture industry, the Fame of a star character boosts the demand for the movie. At our work, we have measured the experience of each key role player. For each performer, we count the total number of movies they had performed between 1990 and 2019. Star_Exp reveals how much experience the actor and actress is. Figure 10 demonstrates the relationship between the IMDb rating and the experience of the actor. It shows that the number of low ratingmovies gets decreased as the experience of the actor increase. The film’s rating is very much unpredictable for the actors who are newcomers or have little experience in the industry. However, for the experienced actor probability of producing a highly-rated movie is significantly higher.
3.6 Movie data labelling (hit/flop)
The motion picture has been one of the fields of art, not always related to the box office. We believe different parameters exist to measure the success of a film. Box office revenue could not be the only parameter. At our work, we have considered Motion pictures from different parts of India. The budget and also the revenue of the movies are being changeable and depend on the movie industry. Movie budget and the film’s income in the Bollywood industry or the Kannada cinema industry is generally higher than other regional movie industries. Therefore, defining revenue range [3] to classify the movie success does not apply to Indian movies. Also, fixing the profit margin [23] would not going to solve the problem of classification. The scale of profit margin is undoubtedly lower for low budget and higher for high budget films.
IMDb rating is one of the significant criteriafor movie success prediction. IMDb rating has been accepted universally. Figure 11 presents the histogram of the IMDB rating of Indian movies. It shows that the rating is near a normal distribution, contributing to the model prediction’s robustness. We have used IMDb rating as one of the primary criteria to determine the movie’s success. With that, total gross income, number of votes also have been used for movie classification. Figure 12 presents The histogram of votes accepted at the IMDb web portal. Figure 13 describes the consistent relationships between the IMDb rating and the total number of votes accepted for the movie. The picture shows that a popular movie means getting a high number of voting, which is also likely to have a higher IMDb rating.
We have judicially chosen IMDb rating as our primary criterion for movie classification. In binary classification, selecting a single point value as a criterion for classification makes the system very strict. We have used the total vote as a supporting parameter. In this work arithmetic median value of the total vote has been taken as the threshold value. Arithmetic median value has been selected. Such that an even number of samples lies on both sides of the threshold value. In the given database arithmetic median of the total vote (amv) is 657 votes. Gross Boss box office revenue could be another criterion for movie hit classification. However, in 30 years of movie data set, a single cut-off value for box office revenue is not logical. Movie budget and movie box office revenue both have increased over time.We have fixed the criteria for a successful movie in our work, as shown in Fig. 14. According to this, classify a movie into a successful class or an unsuccessful class.
3.7 Dataset description
We have n samples and feature vector \( X={\left\{{x}_i\right\}}_{i=1}^n \) with class labels \( ={\left\{{y}_i\right\}}_{i=1}^n \) .
Our data set represented as D = {(x1, y1), (x2, y2), ……, (xn, yn)} = \( {\left\{\left({x}_i,{y}_i\right)\right\}}_{i=1}^{n.} \)
Each feature vector is represented as: xk = (xk, 1, xk, 2, xk, 3, ……., xk, d)
Where d is the number of selected features of each sample, here d = 26 Within all features, only 6 are movie intrinsic features and 20 are derived features. All the data set has been labelled according to the rule mentioned earlier. In the given data set, the total number of samples n = 4043. There are 1688 positive and 2355 negative samples present in the whole data set. The proposed model is learnt using n number of samples. At the initial stage, the data set is split into training and test set D = Dtrain U Dtest. 80% of total data split for training and 20% for test. The Dtrain data is feet into the proposed K-fold cross-validation model for training purposes. This Dtrain is further divided into training and validation of the K-fold model. Dtrain data divided into 20:80 ratio, and 5-fold cross-validation training and validation set are generated. Data splitting are done entirely random manner. All the supervised classifications are trained using 5-fold cross-validation data set. Each model gets the input of 2587 training data and 647 validation data Table 6.
4 Proposed work
4.1 K-fold hybrid deep ensemble model (KHDEM)
Our objective is to build a machine learning model that can accept all the movie-related features xi in our research problem. Some features are of movie intrinsic feature mi; some are derived features related to movie cast and other vital performers di, also social media features of all the principal performers’si. Our feature set includes: xi = {mi, di, si}. Our job is also to learn from the feature set and accurately predict an unknown new movie feature set. To accurately predict the upcoming movie’s success using only pre-production data, we proposed a K-fold Hybrid Deep Ensemble learning Model (KHDEM). In our proposed model, we used different types of classification models. We make an ensemble of deep learning models like CNN and ANN; with that, we alsouse the conventional aggregate classifiers like Random Forest(RF)(Gini, Entropy), ExtraTreesClassifier (ET)(Gini, Entropy), and Gradient Tree Boosting (GT).
Where EM = {CNN, ANN, KNN, RF_G, RF_E, ET_G, ET_E, GB)
In our data set, we have 4044 number of movie data.
We converted all the nominal and categorical attributes related to movie e intrinsic features. Then after we use 72 parameter values of each movie data:
According to our research problem, our model is a binary output model. At the output label, it predicts hit or flop: yi = {1, 0}.
In the beginning, we have split our data set D into training and test set: D = Dtrain U Dtest
In our proposed model, we train each of the individual ensemble models using K-fold cross-validation.
Where \( {\hat{\mathrm{Y}}}_{val}^i \) is the predicted probability value of the modeli after K-fold cross-validation. Where:
\( {\hat{\mathrm{Y}}}_{val}^i \)= \( \frac{1}{k}{\sum}_{i=1}^k{\hat{\mathrm{Y}}}_{val}^{i,j} \)
From each of the ensemble model MEi, we receive a predictability value from the training set Xtrain. Now
have been used as input parameters, and Ytrainis used as the class label of each input data.
We have made our final classification using the traditional Logistic Regression (LR) model to say the final word. We have made the final prediction using our blending model(i.e., LR model). Initially, the LR model is trained with all the data set generated from each ensemble model’s prediction value on Xval. The final prediction made on the data set comes out from each ensemble model’s prediction on Xtest.
The training dataset for the LR model is: BD = BDtrain ∪ BDtest
Where BDtrain = BXtrain, BYtrain and BDtest = BXtest, BYtest
Prediction values of K-fold m ensemble models are considered as the training data of the LR model.
Where \( {BX}_{train}={\hat{\mathrm{Y}}}_{val}=\left\{{\hat{\mathrm{Y}}}_{val}^1,\kern0.5em {\hat{\mathrm{Y}}}_{val}^2,\dots ..,{\hat{\mathrm{Y}}}_{val}^m\right\}={\left\{\ {\hat{\mathrm{Y}}}_{val}^i\ \right\}}_{i=m}^m \) and BYtrain = Ytrain
and final prediction made on the following data set \( {BX}_{test}=\kern0.5em {\cup}_{i=1}^m{\hat{Y}}_{test}^i \) and BXtest = Ytest
Final prediction: \( {\hat{BY}}_{test} \)= LR(BXtrain, BYtrain, BXtest)
Figure 15 represent the proposed K-fold HDEM Model Structure. At first, we have cleaned the input row data and selected the functional features. Next, we have introduced our derived features. Finally, the final dataset with corresponding class labels is divided into train and test sets. Train set again divided into K subset; each contain train data validation data. Our proposed HDEM model has been trained using K subsets of data. Finally, prediction has been made on test data using the LR model.
4.2 Proposed algorithm
4.3 Proposed CNN model
The proposed CNN model is 1D-CNN architecture. It consists of four convolutional layers and one dense layer. The structure of the 1D-CNN is experimentally selected by a trial and error approach. The first Layer of CNN has taken an input of 72 × 1 array comprising all features. In the first layer, we have used 128 filters with kernel size 4. We have adopted the activation function Relu and dropout 0.1. With that, max-pooling is estimated as 2. The next layer maintains 128 kernel size and the same activation function Relu and dropout 0.1 and repeats this three times. Finally, a flattering layer has been used. The Last Layer is a dense layer with size two over here with 2 predictive classes. Multiple Keras Optimizers have experimented like Adam, SGD, RMSprop, and finally, we have selected RMSprop optimizer due to better accuracy. Figure 16 depicts the topology of the proposed 1D-CNN.
4.4 Proposed ANN model
The proposed ANN model architecture consists of one input layer, five hidden layers, and one dense output layer. The structure of the ANN is experimentally selected by a trial and error approach. The inputlayer has 72 input neurons comprising all features. We have adopted the activation function Relu and dropout 0.5. The next hidden layer maintains 256neurons and the same activation function Relu and dropout 0.5 and repeats this four times. Finally, the output layer with 2 predictive classes and uses the sigmoid activation function. Multiple Keras Optimizers have experimented like Adam, SGD, RMSprop, and finally, we have selected SGD optimizer due to better accuracy.
5 Experimental results and analysis
In our problem to classify movie success, we implement a variety of individual machine learning models, including Logistic Regression(LR), Support Vector Machine (SVM), Random Forest(RF), and also Deep Learning (DL) models like ANN, Convolutional Neural Network(CNN) model. After implementing all these conventional MLand DL models, we have proposed our KHDEM model. Our proposed model outperforms all the previous models. Performance measured of each model based on the following four Matrics, where higher values specify better performance:(1) Classification Accuracy: which determines the percentage of correctly predicted instances. (2) Precision is the number of instances classified correctly, divided by the number of predictions classified as successful. (3) Recall: which is the number of forecasts classified as positive prediction that are positive, divided by the number of prediction that is positive. Finally, (4) Loss: which measure the distance between output and expected output.
5.1 Performance without derived features
In our work, at first, we have compared our proposed KHDEM model with four machine learning models without incorporating new derived features. ML models include two Deep Learning (DL) modelsand two conventional machine learning (ML) models. DL models comprise multi-layer perceptron(ANN) and CNN, and traditional ML includes Random Forest (RF) and Logistic regression (LR) models. Table 7 shows the performance result of each of the models without new derived features. From the table, it has been clear that our proposed model performs better than all other ML models.
5.2 Performance with derived features
Next,wehave introduced all our new sets of derived features into the features set and performed all the previous four ML and DL models and proposed KHDEM again. Table 8 represents the performance results of each of the models with new derived features.The introduction of new derived features increases the performance of each model at all measuring criteria.Here also proposed KHDEM model outperforms all the ML and DL model
Table 9 compared the accuracy with new derived features and without new derived features.For example, the proposed KHDEM model’s accuracy improved by 17.62%, and the CNN model’s accuracy increased by 18.82%. Similarly, ANN, Random Forest (RF), and Logistic Regression (LR) model’s accuracy increased by 23.33%, 22.1%, and 20%, respectively. From Table 9, it has been clear that how significant our newly introduced features are. Figure 17 represents the comparison of model accuracy with and without new derived features.
5.3 Performance of KHDEM model
Our proposed KHDEM model makes a hybrid combination of DL models with the conventional simple ensemble models. Unlike the voting classifier or ensemble learning model, we used Logistic Regression (LR) to make the final classification in our HDEM model. Initially, the LR model is trained with all the data set generated from each ensemble model’s prediction value on Xval. The final prediction made on the data set comes
out from each ensemble model’s prediction on Xtest. Table 10, which has shown the accuracy of each model present in the K-fold HDEM. Also, the final prediction accuracy of our proposed model. The final prediction output is come out from the LR model. All the internal models are showing accuracy on Xval. But the final prediction is based on overall test data Xtest. Figure 18 shows the comparison between the performance of each internal model and overall K-fold HDEM model performance.
5.4 Comparative analysis
In this section, we have presented a comparative analysis of our proposed model with four established models. 1) Ahmed et al. (2019) [3] used a voting system to make the early production stage prediction. In that study, they proposed new features to increase prediction accuracy. They classified the movie prediction problem into eight different classes and achieved 85% accuracy. 2) Abidi et al. (2020) [1] inspected each movie’s attributes and selected features relevant to the early stage of movie prediction. They have Implemented five different machine learning models with binary classification and achieved a maximum of 76.6% accuracy with the Generalized linear model (GLM). Michael and Kang (2016) [23] proposed novel features and predicted movie success at an early stage of movie production. They evaluated Machine learning models multiple times with different success criteria. 3) Michael and Kang (2016) achieved a maximum of 90.4% accuracy using binary classification with the Random Forest model. 4) Verma and Garima (2019) [44] proposed “music rating” as one of the most important features to predict movie success. They achieved 87.0% accuracy with the Random Forest model using binary classification. Our proposed model outperforms all the previous models and achieved very high accuracy. Table 11 presented the performance of each discussed model. The proposed KHDEM model delivered 96% accuracy with a 05.5% improvement from the nearest Michael and Kang (2016) model [23]. Figure 19 presents the Performance comparison of HDEM models with other models.
6 Discussion
Our article dedicatedly focused on Indian movies, which is a combination of several movie industries. As a result of the combination, the Indian movie industry produces the maximum number of movies every year. Hollywood is the most famous and most significant movie industry in the world. Movies on up the popular form of art. Almost every country have their movie industry. Many of them are famous worldwide. According to the industries’ significance and popularity, we can name some countries like China, United Kingdom, Japan, France, South Korea, and Germany.
On the other side, in our work, we have used the IMDb dataset and considered its movie rating. Other than IMDb, there are multiple movie ratings and information sites available, like Rotten Tomatoes, Douban, Metacritic, and many more. Rotten Tomatoes is one of the most popular American movie rating websites. It collects critics reviews from the newspaper and news articles. Douban is a Chinese website and dedicated to Chinese movies. Finally, Metacritic is a relatively new website, was created in 1999. Metacritic also aggregates movie information. The colour uses A colour of green, yellow or red summarizes the critic’s recommendations.
7 Conclusions and future works
A prediction made shortly as the cast is finalized would assist the investor in making a financial decision. In this work, we have proposed a k fold HDEM model to predict movie success at an early stage of movie production. Our work is exclusively on Indian movies. We have collected the last 30 years (1990–2019) of Indian movie information, including all the regional movies. The prepared data set is now publicly available (https://www.kaggle.com/sahusandipan/imdb-indian-movie-dataset). We have analyzed the historical data of our dataset and proposed several new features to improve the performance. Due to the enormous diversity, it is a challenging job to determine success criteria. This work establishes the success criteria, which is best suited for an Indian movie. We have performed our study on more than 4000 movie data, and our proposed machine learning model delivered 96% accuracy, which outperforms all the previous works.
Moreover, it is challenging to execute data mining on IMDb due to many features relating to a film in various scopes. Only a few numbers of attributes can be used at early stage prediction. We need to judge each of the features that can be available at the beginning of movie production. We have carefully picked only six movie attributes at our work, making the prediction job much more challenging.
In future work, we plan to classify the movie popularity in multi-class. Furthermore, we are also planning to consider age-wise rating information and predict the popularity of the upcoming movie among viewers of different age groups.
References
Abidi SMR, Xu Y, Ni J, Wang X, Zhang W (2020) Popularity prediction of movies: from statistical modeling to machine learning techniques. Multimed Tools Appl 79(47):35583–35617
Ahmad IS, Bakar AA, Yaakub MR (2020) Movie revenue prediction based on purchase intention mining using YouTube trailer reviews. Inf Process Manag 57(5):102278
Ahmed U, Waqas H, Afzal MT (2020) Pre-production box-office success quotient forecasting. Soft Comput 24(9):6635–6653
Apala KR, Jose M, Motnam S, Chan CC, Liszka KJ, de Gregorio F (2013) Prediction of movies box office performance using social media. In: 2013 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM 2013). IEEE, pp 1209–1214
Bae G, Kim HJ (2019) The impact of movie titles on box office success. J Bus Res 103:100–109
Bai W, Zhang Y, Huang W, Zhou Y, Wu D, Liu G, Xiao L (2020) DeepFusion: predicting movie popularity via cross-platform feature fusion. Multimed Tools Appl 79:19289–19306
Baimbridge M (1997) Movie admissions and rental income: the case of James bond. Appl Econ Lett 4(1):57–61
Basha SM, Rajput DS (2018) A supervised aspect level sentiment model to predict overall sentiment on tweeter documents. Int J Metadata Semant Ontol 13(1):33–41
Basha SM, Rajput DS (2019) Aspects of deep learning: hyper-parameter tuning, regularization, and normalization. In: Intelligent systems. Apple Academic Press, pp 171–186
Boccardelli P, Brunetta F, Vicentini F (2008) What is critical to success in the movie industry? A study on key success factors in the Italian motion picture industry
Castillo A, Benitez J, Llorens J, Luo XR (2021) Social media-driven customer engagement and movie performance: theory and empirical evidence. Decis Support Syst 145:113516
De Vany A, Walls WD (1999) Uncertainty in the movie industry: does star power reduce the terror of the box office? J Cult Econ 23(4):285–318
Deloitte. (2016). A report on, Indywood the Indian film industry. https://www2.deloitte.com/content/dam/Deloitte/in/Documents/technology-media-telecommunications/in-tmt-indywood-film-festival-noexp.pdf. Accessed Nov 2018
Deniz B, Hasbrouck RB (2012) What determines box office success of a movie in the United States, vol 757. Proceedings for the Northeast Region Decision Sciences Institute, pp 447
Du J, Xu H, Huang X (2014) Box office prediction based on microblog. Expert Syst Appl 41(4):1680–1689
Elberse A (2005) The power of stars: creative talent and the success of entertainment products. Paper presented at the INFORMS Marketing Science Conference, Emory University. https://www.hbs.edu/faculty/Pages/item.aspx?num=19356
Elberse A (2007) The power of stars: do star actors drive the success of movies? J Mark 71(4):102–120
Eliashberg J, Jonker JJ, Sawhney MS, Wierenga B (2000) MOVIEMOD: An implementable decision-support system for prerelease market evaluation of motion pictures. Mark Sci 19(3):226–243
Eliashberg J, Hui SK, Zhang ZJ (2007) From story line to box office: a new approach for green-lighting movie scripts. Manag Sci 53(6):881–893
Galvão M, Henriques R (2018) Forecasting movie box office profitability. J Inf Syst Eng Manage 3(3):1–9
Gopinath S, Chintagunta PK, Venkataraman S (2013) Blogs, advertising, and local-market movie box office performance. Manag Sci 59(12):2635–2654
Kolisetty VV, Rajput DS (2020) A review on the significance of machine learning for data analysis in big data. Jordanian J Comput Inf Technol (JJCIT) 6(01):1
Lash MT, Zhao K (2016) Early predictions of movie success: the who, what, and when of profitability. J Manag Inf Syst 33(3):874–903
Latif MH, Afzal H (2016) Prediction of movies popularity using machine learning techniques. Int J Comput Sci Netw Secur (IJCSNS) 16(8):127
Lee K, Park J, Kim I, Choi Y (2018) Predicting movie success with machine learning techniques: ways to improve accuracy. Inf Syst Front 20(3):577–588
Litman BR (1983) Predicting success of theatrical movies: An empirical study. J Pop Cult 16(4):159–175
Lutter M (2014) Creative success and network embeddedness: explaining critical recognition of film directors in Hollywood, 1900–2010, creative success and network embeddedness: explaining critical recognition of film directors in Hollywood, 190002010
Meenakshi K, Maragatham G, Agarwal N, Ghosh I (2018) A data mining technique for analyzing and predicting the success of movie. J Phys Conf Ser 1000(1):012100. IOP Publishing
Meiseberg B, Ehrmann T (2013) Diversity in teams and the success of cultural products. J Cult Econ 37(1):61–86
Meiseberg B, Ehrmann T, Dormann J (2008) We don’t need another hero—implications from network structure and resource commitment for movie performance. Schmalenbach Bus Rev 60(1):74–98
Mestyán M, Yasseri T, Kertész J (2013) Early prediction of movie box office success based on Wikipedia activity big data. PLoS One 8(8):e71226
Nelson RA, Glotfelty R (2012) Movie stars and box office revenues: an empirical analysis. J Cult Econ 36(2):141–166
Nithin VR, Pranav M, Sarath B, Lijiya A (2014) Predicting movie success based on IMDB data. Int J Data Mining Tech Appl 3:365–368
Oghina A, Breuss M, Tsagkias M, De Rijke M (2012) Predicting imdb movie ratings using social media. In: European conference on information retrieval. Springer, Berlin, Heidelberg, pp 503–507
Parimi R, Caragea D (2013) Pre-release box-office success prediction for motion pictures. In: International workshop on machine learning and data Mining in Pattern Recognition. Springer, Berlin, Heidelberg, pp 571–585
Prag J, Casavant J (1994) An empirical study of the determinants of revenues and marketing expenditures in the motion picture industry. J Cult Econ 18(3):217–235
Rajput DS, Thakur RS, Thakur GS (2012) Fuzzy association rule mining based frequent pattern extraction from uncertain data. In: 2012 world congress on information and communication technologies. IEEE, pp 709–714
Rajput DS, Reddy TSK, Raju DN (2020) Investigation on deep learning approach for big data: applications and challenges. In: Deep learning and neural networks: concepts, methodologies, tools, and applications. IGI global, pp 1016–1029
Sharda R, Delen D (2006) Predicting box-office success of motion pictures with neural networks. Expert Syst Appl 30(2):243–254
Simonoff JS, Sparrow IR (2000) Predicting movie grosses: winners and losers, blockbusters and sleepers. Chance 13(3):15–24
Singh J, Goyal G (2019) Anticipating movie success through crowdsourced social media videos. Comput Hum Behav 101:484–494
Taylor DG, Levin M (2014) Predicting mobile app usage for purchasing and information-sharing. Int J Retail Distrib Manag 42(8):759–774. https://doi.org/10.1108/IJRDM-11-2012-0108
Thirty BB (n.d.) The determinants of box office revenue: a case based study
Verma H, Verma G (2020) Prediction model for bollywood movie success: a comparative analysis of performance of supervised machine learning algorithms. Rev Socionetw Strateg 14(1):1–17
Walls WD (2005) Modeling movie success when ‘nobody knows anything’: conditional stable-distribution analysis of film returns. J Cult Econ 29(3):177–190
Wang H, Zhang H (2018, January) Movie genre preference prediction using machine learning for customer-based information. In: 2018 IEEE 8th annual computing and communication workshop and conference (CCWC). IEEE, pp 110–116
Yu X, Liu Y, Huang X, An A (2010) Mining online reviews for predicting sales performance: a case study in the movie domain. IEEE Trans Knowl Data Eng 24(4):720–734
Zaheer A, Soda G (2009) Network evolution: the origins of structural holes. Adm Sci Q 54(1):1–31
Zhang L, Luo J, Yang S (2009) Forecasting box office revenue of movies with BP neural network. Expert Syst Appl 36(3):6580–6587
Zhou Y, Zhang L, Yi Z (2019) Predicting movie box-office revenues using deep neural networks. Neural Comput & Applic 31(6):1855–1865
Funding
This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05–2020.11.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sahu, S., Kumar, R., Long, H.V. et al. Early-production stage prediction of movies success using K-fold hybrid deep ensemble learning model. Multimed Tools Appl 82, 4031–4061 (2023). https://doi.org/10.1007/s11042-022-13448-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13448-0