1 Introduction

Indian Motion Picture industry has grown up over the last few decades. The expansion of the movie industry has been a worldwide phenomenon. The Global box office is moving first, reaching a billiondollars in 2020 (https://www.statista.com/topics/2140/film-industry-in-india/). India is also not very behind in the global competition. The Indian Box Office revenue is tapping 142 billion Indian rupees(1.91 billion USD) (https://www.statista.com/topics/2140/film-industry-in-india/). Furthermore, it has been projected that the box office revenue will reach 190 billion rupees (2.55 billion USD)by 2024 (https://www.statista.com/statistics/316810/box-office-revenue-india/#:~:text=The%20box%20office%20revenue%20for,rupees%20by%20the%20year%202024). According to the statistics, the number of cinema admission in India went up to 2.02 billion (https://www.statista.com/topics/2140/film-industry-in-india/). The Indian motion picture industry is one of the most diversified movie industries. It includes several regional industries from different parts of India, with Different cultures, languages, and economic barriers. Each regional industry has its local stars, regional songs, and its own set of viewers with social expectations. There are several well-known movie industries in India, such as Bollywood; it is one of India’s most famous movie industries for Hindi movies. On the other hand, Kannada cinema also known as Chandanavana or Sandalwood for Kannada language movies. Telugu and the Bengali cinema industry have the same name, Tollywood, but from entirely different parts of India.Similarly, Kollywood is famous for its Tamil cinema. For Malayalam cinema, Mollywood is the industry name. In this way, there are several Indian movie industries at present in the Indian Motion Picture industry.

The movie industry is enormous; investors or producers are investing several million rupees in this industry. Some top budget movies like 2.0, with a budget of 5.75 billion rupees(77.2 million USD). SAAHO, with a budget of 3.25 billion rupees (43.7 million USD). Thugs of Hindostan, with 3.25 billion rupees (43.7 million USD) (https://www.statista.com/statistics/695635/bollywood-production-budget/). The list goes on. Investors are investing money in the hope of getting back profit out of the investment. Some Indian movieindustries are financially powerful, producinghigh-budget commercial movies, while some make low-budget classic movies. According to the Film Federation of India, in the year 2019 Indian movie industry produces a total of 1986 movies,includingall parts of India (http://www.filmfed.org/downloads/Language-wise-Region-2018-19-26062019.pdf). It delivers about 1500–2000 films in 20 languages every year, nearly doubles the number of films offered in Hollywood, making Indiaone of the most dynamic film manufacturersglobally [42]. A report presented in September 2016 titled “Indywood – The Indian Film Industry” by auditing firm Deloitte has predicted that the Indian film industry’s volume would grow at a CAGR of 11% to reach $3.1 billion by 2020 [13, 42]. Unfortunately, many films created in the motion picture industry do not do great at the box office or meet expectations.It indicates that many producers, directors, actors, technicians, exhibitors, etc., get it challenging when a film does not do well.

A considerable amount of investment has been spent on every box-office film. However, a large number of movies fail to achieve success. Due to the movie sector’s uncertainty, researchers and film producers continually feel it necessary to have expert systems that forecast the movie’s success probability preceding its production with reasonable accuracy. Notable research works [10, 18, 29] have been carried to foretell movie popularity. Previously, many works have been performed on post-production [18, 39] or post-release prediction [10, 29, 41]. The forecast has no advantage as the investor has already spent his capital on film production. Moreover, estimates at the late stage are of no benefit to the decision-maker or investor whose money has already been used. Therefore, predicting pre-production or early production stage [3, 23] with significant accuracy is desirable to ensure investment. Unfortunately, most early-stage prediction works focused on Hollywood movies; and all the previous works’ accuracy is not significantly good.Only a few researchers worked on Indian films, mainly focused on Bollywood movies [44], but most targeted prerelease forecasting or have low prediction accuracy.

In our dataset, we have movies with 15 different languages from different parts of India. We have considered only six basic movie features. In addition to that, we have used all the historical data and introduced three new types of features. We have proposed a robust machine learning model called the K-fold Hybrid Deep Ensemble Learning model (HDEM), which is a combination of conventional machine learning (ML) models and deep learning (DL) models [46]. The proposed model is robust and provides better results in all conditions. Our proposed model outperforms all the previous works.

The principal contributions of this study are as follows.

  1. 1)

    We have created an exclusive data set of Indian movies. The dataset includes movies from several regional industries. The dataset is also publicly available for future research works (https://www.kaggle.com/sahusandipan/imdb-indian-movie-dataset).

  2. 2)

    We have proposed a novel K-fold HDEM model by combining conventional ML models and DL models.

  3. 3)

    Mining all the past data, we have introduced several new derived features, which positively enhance the performance.

The rest of this paper is arranged as follows. Section 2 briefly reviews state-of-the-art approaches to film forecasting are manifested. Then, sections 3 and 4 provide the proposed methodology’s details by revealing the new data retrieval, recommended movie intrinsic features, proposed set of novel features, introducing movie hit success criterion and machine learning model. Next, Section 5 and 6 present the experimental results simultaneously with a comparative study of other statistical models shown and explained finally, research contributions and their limitations and further research directions in section 7.

2 Literature survey

The primary efforts covered the research pointed by Litman [26] and examined the attributes and their impacts on the box-office’s performance. Litman additionally examined the attributes (i.e., critics score, genre, cost of the production, suppliers, theatre release date, and award-taking history of actors). The motion picture industry at the time kept growing since Litman’s study and for the benefit of the progress and popularity of the film has been an inspiring and rising research field; consequently, extensive studies have been published. Prag and Casavant [36] presented an intense interest in classifying the relationship amongst attributes such as costs of marketing campaigns, MPAA ratings, sequels, and success of a film. Movie Hit prediction can mainly be divided according to the timing of the prediction.

2.1 Timing of prediction

Based on the predictiontime,movie success prediction works can be divided into two major groups. 1) Late prediction: many works have been performed where predictions were made just before the movie’s release or after the film’s official release [2, 6, 10, 18, 20, 25, 29, 41, 47, 50]. 2) Early prediction: very few of the works performed where predictions have been made at the early stage of the movie production [1, 3, 23, 31, 44]. Early-stage prediction is most helpful to movie makers and investors. However, unfortunately, early prediction is a relatively complex task. Thus prediction accuracy is comparatively low. On the other hand, to the best of our knowledge, not a single work consider the diversity of Indian movie. In our work, we have addressed both of these two issues. The late forecast may access more features to use and get a more accurate prediction. In the case of early prediction, only a few movie features are available, making the movie prediction much more difficult.

2.2 Features selection for prediction

In the movie success prediction, feature selection, and use of attributes are the most critical part of the model. Three types of features have been explored in the previous works.

Viewer-based feature

Viewer-based features are about possible audiences’ acceptance of a film. As the number of optimistic or positive audiences about a movie increased, the likeliness to have higher revenues increased. Movie response can be recovered from several types of media, such as Twitter [4], trailer remarks [2], blogs [21, 47], and crowdsourced social media [11, 41], and movie reviews [29]. Similarly, in [34], they applied social media YouTube and Twitter comments to predict potential audiences.Unfortunately, viewer-based features are available only for late prediction.

Released-dased features

Released-based features mainly concentrate on the time of release and the movie’s availability. In which part of the country is going to release, and the number of theatres in which the film opens is related to movie availability [23, 35, 39, 40, 45, 48]. The profitability possibility would be high if a movie shows in a higher number of theatres and covers the maximum part of the country. Many films are targeted for release at a particular time. For instance, festival releases and periods and dates of releases (spring, summer, etc.) are usually used in the forecast problem [21, 35, 40, 43]. Some additionally strived to capture the competition at the release time [21, 35], which could negatively influence the popularity.Released-based features are partly available for early prediction.

Movie-based features

Movie-based attributes are directly related to a movie, including cast, director, composer, genre, language, and other features. Both the late prediction and early-stage prediction can access these features.In one of the recent works, authors try to find the impact of movie names on movie success [5].The most common characteristic used for cast members is star power—whether the film casts star-artists. The star power of actors has been carried by actor incomes [35], prior award nominations [10], actor rankings [40], and the actor’s number of Twitter followers [4]. Some past studies have investigated the director’s success and tried to relate it with the movie’s success [27]. Some earlier research has claimed that movies’ economic performance is not influenced by star directors’ appearance [10], and directors’ power is not as significant as actors’ for movie success [30]. Many of the researchers conducted research work using some derived features related to key-rule performers in addition to intrinsic movie features [3, 6, 24, 44]. In addition to an individual key-role member of a film has also been investigated from a teamwork viewpoint whether individuals in a team can perform collectively and generate “team chemistry” [23, 29]. The authors of [12, 16, 28, 32] had inspected the relationship between the actor’s star power and the performance of a film.

2.3 Determine movie success

One of the most critical parts of our problem is how we define the success of a film. No benchmark standards exist which determine the success of a movie. Some works have concentrated on box office revenue [2,3,4, 21, 31, 35, 40]. At the same time, some have used the number of admissions [7, 29]. The underlying assumption is to make revenue or the number of admission as the parameter of success. People usually like to think that a movie that sells well at the box office is also thriving. Some of the past works measured success as profitability. It may be a numeric value of revenue [12] or the return on investment (ROI) [17, 19, 23]. Several works classified movies into two classes (success or not) and selected binary classifications; some estimated the prediction as a multi-class classification problem and attempted to classify films into multiple discrete classes [35]. Predictions are also made on continuous numerical values of profit metrics [18, 23, 45], with values of these metrics comprising logarithmic in some works [12, 40, 48].

2.4 Use of machine learning models

The movie hit prediction relied on machine learning models since these learning techniques have developed forecast models with moderate accuracy [15, 19, 39]. For example, [22, 38, 39] have performed some machine learning models such as discriminant analysis, DT, logistic regression (LR), and Neural Network (MLP) and examined the execution to forecast a movie’s success. Authors of [8, 49] have introduced the multi-layer backpropagation with a more quality improved neural network model proposed by [39]. The authors [33] collected movie data from online websites like IMDb and rotten tomatoes and implemented machine learning approaches like linear and Logistic regression and support vector machine (SVM). Authors [25, 37] Proposed Cinema Ensemble Model (CEM) to enhance prediction precision, composed of seven machine learning models and concentrates on selecting attributes. The study [3, 9] proposed some new features to forecast movie box-office success. They have used a Voting system to predict by averaging the output from different machine learning classifiers.

3 Material and methods

Our research problem is to predict the upcoming movie would be a hit or not. Since we predict the class of a movie, It is evident that our problem is a classification problem under supervised learning. Our objective is to classify movie popularity at the early stage of movie production. Classification at early-stage production has some challenges while selecting movie attributes. Only some specific attributes are finalized at the early production stage of the movie.

3.1 Framework

In this study, fast we have extracted the Raw movie data from the IMDb website by web scraping. Then clean and preprocess the stored data. Next, we have incorporated some novel derived features. Finally, all the data with the feature set fit into our proposed k fold HDEM model for movie classification.

Framework for predicting movie popularity has four steps, which are listed below (Fig. 1).

  1. 1.

    Extract raw data from the IMDb website by scraping and preprocessing the data.

  2. 2.

    Data cleaning and preprocessing.

  3. 3.

    Select relevant features and incorporate derived features to quantify the impact of actors, directors, producers, and composers.

  4. 4.

    Use all relevant features to train the K-fold hybrid deep ensemble learning model (k-HDEM).

Fig. 1
figure 1

Framework workflow

3.2 Data extraction and preprocessing

This study strictly focused on Indian movies, and we have included movies from every part of India containing all regional movies. In India, different religion has a different language and also different culture altogether. The data sets used in our study were collected from the IMDb website. The python script has been used for scraping purposes. Used Python library named “Beautiful soup” to extract movie information from the IMDb website. We considered 30 years of Indian movies, which list almost 5000 movies. We use search string in our python script to scroll and reach each movie link in the IMDb web portal. IMDB web portal maintains an individual web page for each of the listed movies. After retrieving the URL of a movie web page, the HTTP request has been sent and gets back the response using a beautiful soap package. Our python script has been used to access all the relevant movie information from the responded movie page, and finally, it has been stored in the CSV file. Each row of the CSV file content attribute of a single movie. The number of rows in the CSV file represents the number of movie data that has been retrieved. From each of the individual movie pages, 19 different attributes have been extracted. Which includes some numerical attributes and also some of them are nominal attributes. After downloading all information, including all the different parts of India, it also comprises the time range between 1990 and 2019. A list of all directors, composers, producers, and casts present in the dataset has been created.We have used a python script to scroll down to each of the listed members’ official pages and retrieve the Facebook like from their official Facebook page.

3.3 Data cleaning and preprocessing

The data attained from the IMDb website needs to be cleaned as the downloaded data are prone to noise. Also, some attribute values are missing due to the attribute value ware not being present on the website. Some of the numerical parameter values are not uniform. For example, gross income is one of the numerical attributes present sometime as US dollar and sometimes as Indian rupees; in that case, we have to convert all the values into a unique unit.

3.3.1 Removing the noise

We have downloaded almost 5000 movie information from the IMDb database. For each of the movies, it consists of several features. First, there are some movies for which maximum attribute values are not present in our downloaded movie database. Next, we need to sort out all those movies with the maximum number of null attribute values. After sorting, we have deleted all records of search movies.

3.3.2 Data transformation

Some numerical parameters like “budget” and “gross income”haveIndian rupees and dollars values. All attributes which consist of multi-unit data are converted into a single unit. Some nominal features in our dataset are also multivalued. Here we have to convert it into a single-valued feature. Our database comprises two types of attributes nominal attribute and numerical attribute, and all the nominal attributes are converted into numerical values (exp. genre, language).

3.3.3 Missing attribute values

After removing all the movie information, which had the maximum number of null values, we left some movies in our dataset, with some numerical attributes with null values. At first, all these null values have been converted to zeros; in the next step of the data preprocessing, we replace all the zero values with the median value.

3.4 Feature selection

Our problem is to predict a movie’s success at the early production stage. According to our problem statement, we could not use all the attributes generally determined late at the production phase. Only a few of the attributes are usually confirmed before or at the early stage of production. According to that, our selection of attributes has to be correctly judged. This selection process has great importance in our research problem. Our limited access to movie-related information makes our movie hit prediction job very tricky and unique. Hence after elimination, we left only four numerical and eight nominal features to make the actual prediction. Selected features are shown in Table 1.

Table 1 Movie intrinsic features (nominal & numerical)

It seems all the nominated features are general types of features. It has been difficult to infer quantitative measurements about the movie, making our job much more challenging. In the above list of numerical features, some of them are nominal or categorical features. We have converted those features into numerical values. Features like “release_month” have 12 different possible values.Features having more than two possible values are converted into n-binary features, where n represents the number of possible values.

3.4.1 Feature description

Genre

is one of the most commonly used attribute values in predicting movie success. In our study, we used twenty different categoriesofgenre values. However, not all categories are significant. Most of the movies belong to more than one number of categories. The most commonly used categories are action, drama, comedy, crime, romance, family, horror, and thriller. These are the eight genres that are commonly used in Indian movies. Genre is a multivalued attribute. Therefore, we convert the attribute into an n-binary feature.

Figure 2 shows the box plot of the relationship between the IMDb rating and genre category. Here we have taken only some of the major categories and kept all the remaining categories under the name ‘other’. Figure 1 shows that the relationship is not strong since some anomalies are present in genre data.

Fig. 2
figure 2

Shows the relationship between movie genre and IMDb rating

Some of the genres like horror, fantasy have lower average IMDb ratings. On the other hand, movies from the genres like crime, adventure, family, and mystery generally have an excellent response to the audience. Hence, the average rating is higher than the other. Similarly, movies from action, comedy, drama and romance genres have a broad spectrum in IMDb rating. From the preceding analysis, it is clear that the movie genre affects IMDb ratings. For some genres, the effect is prominent, but it does not create a notable effect on movie rating for other cases.

Release month

The timing of the movie release and the genre of the movie are two significant factors. In Indian culture, different festivals are celebrated with different people. The producer may think that festival times are the most appropriate time to release a movie. Again at this time, not all types of movies are useful. Release month is a multivalued attribute. We have convertedit into an n-binary feature. Figure 3 shows the box plot of the relationship between the IMDb rating and release month. From Fig. 3, it has been evident that release month does not affect the movie’s popularity.

Fig. 3
figure 3

A boxplot shows the relationship between movie release month and IMDb rating

Release month is not directly related to IMDb rating. However, it has been observed that type of movie released is highly inspired by the different cultural activities celebrated at different times of the year.In the early part, we observe that the IMDb rating and movie genre have a coherence relationship that establishes an indirect connection between release month and IMDb rating.

Movie language

It is one of the essential features of Indian movies. With the presents of the Hindi movie industry, there exist several regional movie industries. More importantly, several papular movie Industries are present in South India, and each of them operates in their regional language. Also, the Bengali movie industry is quite popular in the Eastern part of India. In our study, we have considered 15 different languages. Consideration of different parts of Indian movies makes our job much more challenging. Movie language is a multivalued attribute. We have converted the attribute into the n-binary feature. Also, movies are often published in multiple languages. Therefore, we have made some modifications to prioritize the language of origin of a particular movie.

Figure 4 presents the box plot to show the relationship between the IMDB rating and movie language. The figure shows that the median rating of some movie languages like Gujarati, Kannada, Malayalam, and Marathi is generally high. Similarly, movies from Oriya, English and Hindi languages have low IMDb ratings. Moreover, movie language has a significant role in IMDb rating.

Fig. 4
figure 4

Boxplot present the relationship between IMDb rating and movie language

Movie duration

Movie duration is one of the numerical attributes we have selected as crucial work attributes. The runtime of the majority number of movies falls between 90 to 180 minutes. In Fig. 5, we have plotted movie runtime versus IMDb rating. Figure 5 shows the well functional relationship as the movie duration becomes longer; the IMDb rating also increases.

Fig. 5
figure 5

A scatter plot presents the relationship between the movie duration and IMDb rating

Movie budget

The movie budget is one of the essential numerical attributes. It is strongly correlated with its box office revenue movie (Deniz and Hasbrouck, 2012) [14]. Litman (1983) [26]. At our work, we have considered a significant span of the period, from 1990 to 2019. In these 30 years,themovie budget has significantly increased in the Indian movie industry (Fig. 6). Also, we considered different regional movie industries. The movie budget is not comparable to each regional industry. The budget of Bollywood movies and South Indian films is comparatively high than the other regional movies.

Fig. 6
figure 6

Represents the relationship between the movie budget and production year

Figure 7 shows the relationship between the movie budget and IMDb rating. It has been observed that a higher budget movie have a General tendency to have a higher movie rating. However, the connection between the movie budget and IMDB rating is very faint.

Fig. 7
figure 7

A scatter plot represents the relationship between the movie budget and IMDb Rating

Released year

At our work, we have considered Indian movies of the last three decades. Depending upon the movie’s time frame or the film’s release year, the impact of other parameters related to movies may also vary. In Fig. 6, we have presented how the movie budget increases during the period.

3.5 Derived features

In the movie industry, successful and star performers’ presence is always making a difference. Each individual plays an essential role in the movie.Hero, heroine, music composer, the director are the most influential performers. Popular and trending hero heroine composers and directors always have a positive influence on the movie’s success.

Facebook page likes

As described earlier, the number of likes present on the official Facebook page of an actor or actress shows the popularity in social media. The Facebook promotion also plays a significant impact in the success of the movie. At our work, we have introduced the number of Facebook likes as an attribute.

Popularity rank (PR)

As we have discussed, the popularity and acceptance of a star is a significant factor for all movies to become hit or not. We proposed a way to compute the popularity rank of each of the three main actors, director, composer, and producer. The popularity rankof each of the stars performer Pk is between 0 to 10. For each Pk, we have considered each movie Mj where they performed. Now popularity rank PR(Pk) of a performer Pk calculated using the following formula (1). Consider a performer Pk, has performed in a set of movies:\( {M}^k={\left\{{M}_j^k\right\}}_{j=1}^{N^k} \) Where Nk is the no. of movie Pk participated.Popularity Factor(PF) and Year Coefficient(YC) has been defined next.

$$ \mathrm{PR}\left({P}_k\right)=\frac{\sum_{\mathrm{j}=1}^{N^k} YC\left({M}_j^k\right)\ast PF\left({M}_j^k\right)}{N^k} $$
(1)

Popularity factor(PF)

We considered IMDb rating (R), gross income (Gi), Global gross income (Gg), and also the number of votes (V), are the parameters to estimate the popularity or the success of a movie (Mj). Popularity factor PF (Mj) measured the success of a movie (Mj). We have defined PF as follows (Table 2):

Table 2 Popularity factor PF(Mj) of a movie Mj and satisfied condition

We have considered four parameters. According to our observation, we have set a priority sequence as follows R, Gi, Gg, and V. Among all, Rating (R) has maximum priority. The number of votes (V) has the lowest priority since it only shows how many viewers post the rating. Furthermore, the Popularity Factor(PF) values have been set according to the following rules.

  1. 1.

    Total number of conditions satisfied.

  2. 2.

    Priority of the condition.

Due to a few movies, Gi, Gg’s boundary value and V increase abnormally, making the mean value higher. We had selected the median value as the cut-off of each parameter. Such that an even number of samples lies on both sides of the cut-off.

Year coefficient(YC)

It regulates the credit point depending on how recent the movie is. Popularity points for an old hit movie and a recent hit movie are not the same. Therefore, credit has been transferred more for the current movie than the old movie (Table 3).

Table 3 Year coefficient values with the year of realizing of movie

Year coefficient is necessary to prioritize the movie’s success according to how recent the movie is. All the movies before 2010, YC(Mj) value is 0.50, which means we are giving half the popularity factor PF(Mj). YC is a vital component since a recent hit movie always gives extra publicity to an actor, director or composer instead of a hit movie produced 3 to 4 years back.

At our work, we determined the popularity rank of three significant artists present in the movie (pop_rank_act1, pop_rank_act2, pop_rank_act3), popularity rank of the director of the movie (pop_rank_dir), composer of the film (pop_rank_com), and also the producer of the movie (pop_rank_prod) (Tables 4 and 5). Figure 8 presents the relationship between actor one (main cast) popularity rank with the movie’s IMDb rating. Popularity rank has a patterned relationship with the movie rating.While the popularity rank increases, the movie’s rating also gets increased. Figure 9 displays the relationship between the director’s popularity rank and the IMDb rating of the movie. It seems that popularity rank has a steady relationship with the movie rating. One can observe that the popularity rank increases with the movie rank.

Table 4 Derived features
Table 5 New derived features sample values
Fig. 8
figure 8

Represents the relationship between actor one popularity rank and IMDb rating

Fig. 9
figure 9

Represents the relationship between director popularity rank and IMDb rating

Star experience

In the movie industry, the experience and popularity of star performers influence potential viewers. Every senior star actor, composer or director, has their fan followers. In the motion picture industry, the Fame of a star character boosts the demand for the movie. At our work, we have measured the experience of each key role player. For each performer, we count the total number of movies they had performed between 1990 and 2019. Star_Exp reveals how much experience the actor and actress is. Figure 10 demonstrates the relationship between the IMDb rating and the experience of the actor. It shows that the number of low ratingmovies gets decreased as the experience of the actor increase. The film’s rating is very much unpredictable for the actors who are newcomers or have little experience in the industry. However, for the experienced actor probability of producing a highly-rated movie is significantly higher.

Fig. 10
figure 10

Represents the relationship between actor one experience and IMDb rating

3.6 Movie data labelling (hit/flop)

The motion picture has been one of the fields of art, not always related to the box office. We believe different parameters exist to measure the success of a film. Box office revenue could not be the only parameter. At our work, we have considered Motion pictures from different parts of India. The budget and also the revenue of the movies are being changeable and depend on the movie industry. Movie budget and the film’s income in the Bollywood industry or the Kannada cinema industry is generally higher than other regional movie industries. Therefore, defining revenue range [3] to classify the movie success does not apply to Indian movies. Also, fixing the profit margin [23] would not going to solve the problem of classification. The scale of profit margin is undoubtedly lower for low budget and higher for high budget films.

IMDb rating is one of the significant criteriafor movie success prediction. IMDb rating has been accepted universally. Figure 11 presents the histogram of the IMDB rating of Indian movies. It shows that the rating is near a normal distribution, contributing to the model prediction’s robustness. We have used IMDb rating as one of the primary criteria to determine the movie’s success. With that, total gross income, number of votes also have been used for movie classification. Figure 12 presents The histogram of votes accepted at the IMDb web portal. Figure 13 describes the consistent relationships between the IMDb rating and the total number of votes accepted for the movie. The picture shows that a popular movie means getting a high number of voting, which is also likely to have a higher IMDb rating.

Fig. 11
figure 11

Distribution of number of movies against IMDb rating

Fig. 12
figure 12

Distribution of number of movies and number of votes

Fig. 13
figure 13

Relationship between IMDb rating and the number of votes

We have judicially chosen IMDb rating as our primary criterion for movie classification. In binary classification, selecting a single point value as a criterion for classification makes the system very strict. We have used the total vote as a supporting parameter. In this work arithmetic median value of the total vote has been taken as the threshold value. Arithmetic median value has been selected. Such that an even number of samples lies on both sides of the threshold value. In the given database arithmetic median of the total vote (amv) is 657 votes. Gross Boss box office revenue could be another criterion for movie hit classification. However, in 30 years of movie data set, a single cut-off value for box office revenue is not logical. Movie budget and movie box office revenue both have increased over time.We have fixed the criteria for a successful movie in our work, as shown in Fig. 14. According to this, classify a movie into a successful class or an unsuccessful class.

Fig. 14
figure 14

Flow diagram of movie labelling (Hit/Flop)

3.7 Dataset description

We have n samples and feature vector \( X={\left\{{x}_i\right\}}_{i=1}^n \) with class labels \( ={\left\{{y}_i\right\}}_{i=1}^n \) .

Our data set represented as D = {(x1, y1), (x2, y2), ……, (xn, yn)} = \( {\left\{\left({x}_i,{y}_i\right)\right\}}_{i=1}^{n.} \)

Each feature vector is represented as: xk = (xk, 1, xk, 2, xk, 3, ……., xk, d)

Where d is the number of selected features of each sample, here d = 26 Within all features, only 6 are movie intrinsic features and 20 are derived features. All the data set has been labelled according to the rule mentioned earlier. In the given data set, the total number of samples n = 4043. There are 1688 positive and 2355 negative samples present in the whole data set. The proposed model is learnt using n number of samples. At the initial stage, the data set is split into training and test set D = Dtrain U Dtest. 80% of total data split for training and 20% for test. The Dtrain data is feet into the proposed K-fold cross-validation model for training purposes. This Dtrain is further divided into training and validation of the K-fold model. Dtrain data divided into 20:80 ratio, and 5-fold cross-validation training and validation set are generated. Data splitting are done entirely random manner. All the supervised classifications are trained using 5-fold cross-validation data set. Each model gets the input of 2587 training data and 647 validation data Table 6.

Table 6 Data sample description

4 Proposed work

4.1 K-fold hybrid deep ensemble model (KHDEM)

Our objective is to build a machine learning model that can accept all the movie-related features xi in our research problem. Some features are of movie intrinsic feature mi; some are derived features related to movie cast and other vital performers di, also social media features of all the principal performers’si. Our feature set includes: xi = {mi, di, si}. Our job is also to learn from the feature set and accurately predict an unknown new movie feature set. To accurately predict the upcoming movie’s success using only pre-production data, we proposed a K-fold Hybrid Deep Ensemble learning Model (KHDEM). In our proposed model, we used different types of classification models. We make an ensemble of deep learning models like CNN and ANN; with that, we alsouse the conventional aggregate classifiers like Random Forest(RF)(Gini, Entropy), ExtraTreesClassifier (ET)(Gini, Entropy), and Gradient Tree Boosting (GT).

$$ \mathrm{EM}={\bigcup}_{\mathrm{i}=1}^{\mathrm{m}}{ME}_i, $$

Where EM = {CNN, ANN, KNN, RF_G, RF_E, ET_G, ET_E, GB)

In our data set, we have 4044 number of movie data.

$$ D=\left(\mathrm{X},\mathrm{Y}\right)={\left\{\left({x}_i,{y}_i\right)\right\}}_{i=1}^{4043} $$

We converted all the nominal and categorical attributes related to movie e intrinsic features. Then after we use 72 parameter values of each movie data:

$$ {x}_k=\left({x}_{k,1},{x}_{k,2},{x}_{k,3},\dots \dots .,{x}_{k,72}\right). $$

According to our research problem, our model is a binary output model. At the output label, it predicts hit or flop: yi = {1, 0}.

In the beginning, we have split our data set D into training and test set: D = Dtrain U Dtest

$$ \mathrm{Dtrain}=\left(\mathrm{Xtrain},\mathrm{Ytrain}\right),\mathrm{Dtest}=\left(\mathrm{Xtest},\mathrm{Ytest}\right) $$

In our proposed model, we train each of the individual ensemble models using K-fold cross-validation.

$$ {\hat{\mathrm{Y}}}_{val}^i=\mathrm{EMi}\left( Xtrain, Ytrain, Xval\right) $$

Where \( {\hat{\mathrm{Y}}}_{val}^i \) is the predicted probability value of the modeli after K-fold cross-validation. Where:

\( {\hat{\mathrm{Y}}}_{val}^i \)= \( \frac{1}{k}{\sum}_{i=1}^k{\hat{\mathrm{Y}}}_{val}^{i,j} \)

From each of the ensemble model MEi, we receive a predictability value from the training set Xtrain. Now

$$ {\hat{\mathrm{Y}}}_{val}=\left\{{\hat{\mathrm{Y}}}_{val}^1,\kern0.5em {\hat{\mathrm{Y}}}_{val}^2,\dots ..,{\hat{\mathrm{Y}}}_{val}^m\right\}={\cup}_{i=1}^m{\hat{\mathrm{Y}}}_{val}^i $$

have been used as input parameters, and Ytrainis used as the class label of each input data.

We have made our final classification using the traditional Logistic Regression (LR) model to say the final word. We have made the final prediction using our blending model(i.e., LR model). Initially, the LR model is trained with all the data set generated from each ensemble model’s prediction value on Xval. The final prediction made on the data set comes out from each ensemble model’s prediction on Xtest.

The training dataset for the LR model is: BD = BDtrain ∪ BDtest

Where BDtrain = BXtrain, BYtrain and BDtest = BXtest, BYtest

Prediction values of K-fold m ensemble models are considered as the training data of the LR model.

Where \( {BX}_{train}={\hat{\mathrm{Y}}}_{val}=\left\{{\hat{\mathrm{Y}}}_{val}^1,\kern0.5em {\hat{\mathrm{Y}}}_{val}^2,\dots ..,{\hat{\mathrm{Y}}}_{val}^m\right\}={\left\{\ {\hat{\mathrm{Y}}}_{val}^i\ \right\}}_{i=m}^m \) and BYtrain = Ytrain

and final prediction made on the following data set \( {BX}_{test}=\kern0.5em {\cup}_{i=1}^m{\hat{Y}}_{test}^i \) and BXtest = Ytest

Final prediction: \( {\hat{BY}}_{test} \)= LR(BXtrain, BYtrain, BXtest)

Figure 15 represent the proposed K-fold HDEM Model Structure. At first, we have cleaned the input row data and selected the functional features. Next, we have introduced our derived features. Finally, the final dataset with corresponding class labels is divided into train and test sets. Train set again divided into K subset; each contain train data validation data. Our proposed HDEM model has been trained using K subsets of data. Finally, prediction has been made on test data using the LR model.

Fig. 15
figure 15

Proposed HDEM Model Structure

4.2 Proposed algorithm

figure a

4.3 Proposed CNN model

The proposed CNN model is 1D-CNN architecture. It consists of four convolutional layers and one dense layer. The structure of the 1D-CNN is experimentally selected by a trial and error approach. The first Layer of CNN has taken an input of 72 × 1 array comprising all features. In the first layer, we have used 128 filters with kernel size 4. We have adopted the activation function Relu and dropout 0.1. With that, max-pooling is estimated as 2. The next layer maintains 128 kernel size and the same activation function Relu and dropout 0.1 and repeats this three times. Finally, a flattering layer has been used. The Last Layer is a dense layer with size two over here with 2 predictive classes. Multiple Keras Optimizers have experimented like Adam, SGD, RMSprop, and finally, we have selected RMSprop optimizer due to better accuracy. Figure 16 depicts the topology of the proposed 1D-CNN.

Fig. 16
figure 16

topology of the proposed 1D-CNN

4.4 Proposed ANN model

The proposed ANN model architecture consists of one input layer, five hidden layers, and one dense output layer. The structure of the ANN is experimentally selected by a trial and error approach. The inputlayer has 72 input neurons comprising all features. We have adopted the activation function Relu and dropout 0.5. The next hidden layer maintains 256neurons and the same activation function Relu and dropout 0.5 and repeats this four times. Finally, the output layer with 2 predictive classes and uses the sigmoid activation function. Multiple Keras Optimizers have experimented like Adam, SGD, RMSprop, and finally, we have selected SGD optimizer due to better accuracy.

5 Experimental results and analysis

In our problem to classify movie success, we implement a variety of individual machine learning models, including Logistic Regression(LR), Support Vector Machine (SVM), Random Forest(RF), and also Deep Learning (DL) models like ANN, Convolutional Neural Network(CNN) model. After implementing all these conventional MLand DL models, we have proposed our KHDEM model. Our proposed model outperforms all the previous models. Performance measured of each model based on the following four Matrics, where higher values specify better performance:(1) Classification Accuracy: which determines the percentage of correctly predicted instances. (2) Precision is the number of instances classified correctly, divided by the number of predictions classified as successful. (3) Recall: which is the number of forecasts classified as positive prediction that are positive, divided by the number of prediction that is positive. Finally, (4) Loss: which measure the distance between output and expected output.

5.1 Performance without derived features

In our work, at first, we have compared our proposed KHDEM model with four machine learning models without incorporating new derived features. ML models include two Deep Learning (DL) modelsand two conventional machine learning (ML) models. DL models comprise multi-layer perceptron(ANN) and CNN, and traditional ML includes Random Forest (RF) and Logistic regression (LR) models. Table 7 shows the performance result of each of the models without new derived features. From the table, it has been clear that our proposed model performs better than all other ML models.

Table 7 Represent the machine learning model’s performance without new derived features

5.2 Performance with derived features

Next,wehave introduced all our new sets of derived features into the features set and performed all the previous four ML and DL models and proposed KHDEM again. Table 8 represents the performance results of each of the models with new derived features.The introduction of new derived features increases the performance of each model at all measuring criteria.Here also proposed KHDEM model outperforms all the ML and DL model

Table 8 Represent the machine learning model’s performance with new derived features

Table 9 compared the accuracy with new derived features and without new derived features.For example, the proposed KHDEM model’s accuracy improved by 17.62%, and the CNN model’s accuracy increased by 18.82%. Similarly, ANN, Random Forest (RF), and Logistic Regression (LR) model’s accuracy increased by 23.33%, 22.1%, and 20%, respectively. From Table 9, it has been clear that how significant our newly introduced features are. Figure 17 represents the comparison of model accuracy with and without new derived features.

Table 9 The accuracy of each model with and without new features and accuracy improvement
Fig. 17
figure 17

Comparison of model accuracy with and without new derived features

5.3 Performance of KHDEM model

Our proposed KHDEM model makes a hybrid combination of DL models with the conventional simple ensemble models. Unlike the voting classifier or ensemble learning model, we used Logistic Regression (LR) to make the final classification in our HDEM model. Initially, the LR model is trained with all the data set generated from each ensemble model’s prediction value on Xval. The final prediction made on the data set comes

out from each ensemble model’s prediction on Xtest. Table 10, which has shown the accuracy of each model present in the K-fold HDEM. Also, the final prediction accuracy of our proposed model. The final prediction output is come out from the LR model. All the internal models are showing accuracy on Xval. But the final prediction is based on overall test data Xtest. Figure 18 shows the comparison between the performance of each internal model and overall K-fold HDEM model performance.

Table 10 Shows the detailed performance result of the proposed HDEM model
Fig. 18
figure 18

Performance of each internal model and overall K-fold HDEM model performance

5.4 Comparative analysis

In this section, we have presented a comparative analysis of our proposed model with four established models. 1) Ahmed et al. (2019) [3] used a voting system to make the early production stage prediction. In that study, they proposed new features to increase prediction accuracy. They classified the movie prediction problem into eight different classes and achieved 85% accuracy. 2) Abidi et al. (2020) [1] inspected each movie’s attributes and selected features relevant to the early stage of movie prediction. They have Implemented five different machine learning models with binary classification and achieved a maximum of 76.6% accuracy with the Generalized linear model (GLM). Michael and Kang (2016) [23] proposed novel features and predicted movie success at an early stage of movie production. They evaluated Machine learning models multiple times with different success criteria. 3) Michael and Kang (2016) achieved a maximum of 90.4% accuracy using binary classification with the Random Forest model. 4) Verma and Garima (2019) [44] proposed “music rating” as one of the most important features to predict movie success. They achieved 87.0% accuracy with the Random Forest model using binary classification. Our proposed model outperforms all the previous models and achieved very high accuracy. Table 11 presented the performance of each discussed model. The proposed KHDEM model delivered 96% accuracy with a 05.5% improvement from the nearest Michael and Kang (2016) model [23]. Figure 19 presents the Performance comparison of HDEM models with other models.

Table 11 Presents the comparison of HDEM models with other models
Fig. 19
figure 19

Performance comparison of HDEM models with other models

6 Discussion

Our article dedicatedly focused on Indian movies, which is a combination of several movie industries. As a result of the combination, the Indian movie industry produces the maximum number of movies every year. Hollywood is the most famous and most significant movie industry in the world. Movies on up the popular form of art. Almost every country have their movie industry. Many of them are famous worldwide. According to the industries’ significance and popularity, we can name some countries like China, United Kingdom, Japan, France, South Korea, and Germany.

On the other side, in our work, we have used the IMDb dataset and considered its movie rating. Other than IMDb, there are multiple movie ratings and information sites available, like Rotten Tomatoes, Douban, Metacritic, and many more. Rotten Tomatoes is one of the most popular American movie rating websites. It collects critics reviews from the newspaper and news articles. Douban is a Chinese website and dedicated to Chinese movies. Finally, Metacritic is a relatively new website, was created in 1999. Metacritic also aggregates movie information. The colour uses A colour of green, yellow or red summarizes the critic’s recommendations.

7 Conclusions and future works

A prediction made shortly as the cast is finalized would assist the investor in making a financial decision. In this work, we have proposed a k fold HDEM model to predict movie success at an early stage of movie production. Our work is exclusively on Indian movies. We have collected the last 30 years (1990–2019) of Indian movie information, including all the regional movies. The prepared data set is now publicly available (https://www.kaggle.com/sahusandipan/imdb-indian-movie-dataset). We have analyzed the historical data of our dataset and proposed several new features to improve the performance. Due to the enormous diversity, it is a challenging job to determine success criteria. This work establishes the success criteria, which is best suited for an Indian movie. We have performed our study on more than 4000 movie data, and our proposed machine learning model delivered 96% accuracy, which outperforms all the previous works.

Moreover, it is challenging to execute data mining on IMDb due to many features relating to a film in various scopes. Only a few numbers of attributes can be used at early stage prediction. We need to judge each of the features that can be available at the beginning of movie production. We have carefully picked only six movie attributes at our work, making the prediction job much more challenging.

In future work, we plan to classify the movie popularity in multi-class. Furthermore, we are also planning to consider age-wise rating information and predict the popularity of the upcoming movie among viewers of different age groups.