1 Introduction

Movie industry has been expanding all across the world for a long time. Movies are a source of entertainment for the people who have built interest and desire among them to learn as well as enjoy the source of entertainment. In earlier times, television was the only source where people could enjoy their lives. But as time flew, the movie industry was set up which introduced another platform for people as a source of happiness and entertainment. The movie industry also provides a platform for generating brought jobs, revenue, and infrastructure development of the location. It impacts the economy worldwide, which has increased exponentially over time.

For making a movie, there are various attributes that are taken into consideration like genre, cast, writer, director, producers etc. Movie industry supports and presents every kind of environment from a comedy movie to thriller, inspirational and devotional content. Every year, thousands of movies are launched, and each of them is declared either as a hit or a flop [13]. This hit and flop decide on the net profit which is Net income (movie rights + tickets sold)—Net cost (making cost + promotions). Hence, if the net profit is greater than zero, it’s a hit else flop. Internet Movie Database (IMDb) is a platform where the complete database of movies, including all of their attributes is maintained. On this data, data mining techniques could be applied to study the variations and thus, could be utilized for the hit/flop prediction of a movie even before its release [2]. Like IMDb, there is another platform named Box Office India (BOI), where similar data information is available and is restrict to Bollywood movies only. Thus, data used in this study has been extracted from Box Office India (BOI) and Internet Movie Database (IMDb). These two platforms provided all the necessary attributes that are required for the movie hit/flop prediction like genre, budget, gross, cast, directors, and writers.

This manuscript present research on the following groundbreaking question: Can we predict the success or failure of an upcoming or new movie by analyzing the given attributes? In this study, the datasetFootnote 1 was created with all the desired attributes. This dataset was extracted and merged from two platforms, IMDb and BOI. All the desired attributes were considered on which various algorithms were implemented. The baseline models such as, support vector machine (SVM), k-nearest neighbors (KNN) etc. were taken on which the dataset was trained and implemented. After ruling out the baseline models, in order to improve the accuracy different permutation and combinations of the baseline models were considered and applied to different ensembles. The ensembles gave a comparatively better results from the baseline models. Along with the combinational implementation of baseline models on the ensembles, some pre-defined ensembles were also tested. After implementing those, the paper concluded that the pre-defined ensembles out performed all the previously obtained results and gave the highest accuracy. Thus, the prepared dataset could give the best accuracy using different methodology such as, the baseline models, ensembles (ref. to Sect. 5). In future, the paper would prefer to get more efficient algorithms with improved accuracy on this dataset.

Therefore, all the research and implementation of our preliminary tests, the most successful and accurate among all of the algorithms are explained in this study. The layout of our study is as follows: Sect. 2 defines motivation and contribution of the paper; Sect. 3 contains the related works; Sect. 4 contains material followed by Sect. 5 describing the methods used in this study. Section 6 explains the result and analysis over the methodologies adopted. Section 7 discusses the conclusion and future scope of the work.

2 Motivation and contribution outline

The motivation behind the work performed in the paper lies in the need to predict the success/failure rate of a movie before-hand to enhance the growth of the film industry. Predicting the movie rating would help the film industry to target the audience accordingly and manage the cost saving. The contributions of the paper are as follows:

  1. 1.

    The paper proposes various ensembles to predict the success/failure of a movie which has outperformed the existing work.

  2. 2.

    It also presents an analytical view of the attributes that affects the movie rating.

  3. 3.

    The paper also shows a comparison between the existing work presented by various authors. It also emphasizes the flaws that have been included in this study.

3 Related works

In literature, studies have been done on movie prediction using various methodologies. Different studies have been recorded having different accuracies and model implementation techniques. Table 1 shows the different studies done on Movie Success Prediction. Doshi et al. [7] explained the movie prediction using sentimental analysis. They attained an average accuracy of 80%. Quader et al. [24] obtained a highest accuracy of 89.27% using neural network for analyzing movie box office success rate. Here, authors in [31] also studied and implemented various machine learning models and gained a successful accuracy of 96.81% after recording the test analysis gained from the reviews from different resources. In [9], authors have used a PT-NT ratio as their technique and gained an accuracy of which is quite low in comparison with other after using the sentimental analysis methodology. Raj et al. [5] studied on movie prediction and implemented various machine learning models such as random forest, linear regression and attained a highest accuracy of 92.08% from random forest algorithm.

Table 1 Existing contributions in movie rating analytics

One of the famous studies was done by Latif et al. [14] on movie success rate prediction. They gained the highest accuracy of 84.34% after implementing various machine learning algorithms. Authors in [10] also studied various algorithms to predict the movie success rate and obtained 68.8% accuracy on applying SVM algorithm. Later came the contribution by Verma et al. [27], who implemented various machine learning models. They could justify their accuracy ranging from 80 to 90% with random forest highlighted at the top of all the other models applied. Authors in [23], also tried implementing some baseline models for the movie success prediction. They came up with different decision tree algorithms and ranked J48 as the highest among all the other algorithms in this category with an accuracy of 84.67%.

Later, authors in [17] decided to work on movie success rate prediction. They started their study and analysis. They used different classifier to study the success rate using sentimental analysis i.e., from reviews and other comments from different sections and gained a round off accuracy of 82.99% as their best score.

Table 1 shows all such studies, their methodology and algorithm used. Existing studies [21] have also made attempts to analyze the sentiment from IMDb reviews. They determined the audience’s viewpoints on various aspects of movies [22]. Thus, it concludes that for every research done by the authors. Still, there are a count of research gaps to be filled by considering a greater number of parameters and attributes in order to attain better results.

4 Material

4.1 Ensemble learning algorithms

Ensemble learning algorithms uses a very basic idea. They combine the decision from multiple models to improve the overall performance [6]. It is a well—known technique which is used to improve the performance of a model. Ensembles have various techniques to do this.

In this study, two popular ensemble techniques i.e., max voting, boosting were used. These were the techniques used in this study and based on these techniques various algorithms were studied. The algorithms studied in this study were:

  1. 1)

    Voting classifier

  2. 2)

    Gradient Boosting

  3. 3)

    AdaBoost

  4. 4)

    XGBM (XG Boost)

figure a

4.1.1 Voting classifier

In this study, three ensembles were made out of which one technique applied was max voting, which is a general ensemble technique [1]. Ensemble 1 comprises of a support vector machine (SVM), K—nearest neighbors (KNN), naïve Bayes at the inner layer. Voting classifier is used at the outer layer for dataset training. This study uses 10 K—fold validation as depicted in Fig. 1.

Fig. 1
figure 1

Ensemble 1 using max voting technique

figure b

4.1.2 Gradient boosting

Gradient boosting is a machine learning technique which is used for regression and classification problems [29]. It forms an ensemble by collecting all the weak models and obtain an accurate result.

Ensemble 2 was made using the technique of gradient boosting which is one of the advanced ensemble techniques. The ensemble consists of a support vector machine (SVM), K—nearest neighbors (KNN) at the inner layer. Including gradient classifier at the outer layer. The ensemble takes 1000 value as estimator with a learning rate of 0.005. Figure 2 shows the inner and outer layer of the ensemble 2 formed.

Fig. 2
figure 2

Ensemble 2 using Gradient Boosting

figure c

4.1.3 AdaBoost

AdaBoost is a meta—estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that the subsequent classifiers focus more on difficult cases [16]. This is considered to be one of the most accurate and precise algorithms among all the others.

Ensemble 3 was made using the advanced technique of ensemble called boosting which contains a technique named AdaBoost. This ensemble consists of a support vector machine (SVM), K—nearest neighbors (KNN), Naïve Bayes at the inner layer with AdaBoost at the outer layer. The model has been trained using a learning rate of 0.005 with 1000 as the estimator. As shown in the Fig. 3, the model uses SVM, KNN as the base and AdaBoost as the training classifier.

Fig. 3
figure 3

Ensemble 3 using AdaBoost Classifier

figure d

4.1.4 XGBoost

XGBoost is an optimized version of gradient boosting which is highly efficient, flexible and portable [3]. It helps in implementing machine learning algorithms under the gradient boosting framework. It is relevantly fast and an accurate way to solve various problems (Table 2).

Table 2 List of attributes of the dataset obtained

4.2 Supervised learning models

Supervised learning models are based on machine learning which uses the mapping of functions to make predictions by training the machine and then testing it on various inputs. There are various machine learning models which can be used in this study are as follows:

  1. 1)

    Support vector machines (SVM)

  2. 2)

    Naïve Bayes

  3. 3)

    K—nearest neighbors (KNN)

  4. 4)

    AdaBoost

  5. 5)

    Random forest

  6. 6)

    Ensemble learning algorithms

  7. 7)

    Neural network algorithms

In this study, these supervised learning algorithms were studied and implemented to train the machine and test on the dataset to predict the success rate.

4.2.1 Support vector machines (SVM)

Support vector machines algorithm is a supervised learning model which is associated with a learning algorithm that analyzes data used for classification and regression analysis [25]. This algorithm is used to solve both the classification and regression problems.

4.2.2 Naïve Bayes

Naïve Bayes algorithm is a classifier algorithm which is based on Bayes’ theorem with an assumption of independence among predictors [26]. It assumes that the presence of features present in a class is not related to the presence of any other feature.

4.2.3 K—nearest neighbors (KNN)

K—nearest neighbors is another algorithm which is used to solve both classification and regression problems. KNN algorithm assumes that similar things exist in close proximity [8].

4.2.4 Random forest

Random forest is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest [15]. It is also a classifier for the test and training of the model. It is a classifier that built many classification trees as a forest of random decision trees, each constructed using a random subset of the features. In this study, 100 trees are used in the random forest algorithm.

4.3 Neural network algorithms

Neural networks are one of the learning algorithms used within machine learning. It consists of different layers to predict and analyze the data [12]. Multilayer perceptron (MLP) is a class feedforward artificial neural network (ANN) [30]. It has various features and many hidden layers in it. In this study, four hidden layers with 30 features each were used to predict the movie success rate.

5 Methodology

In order to predict the success rate of a movie, a proficient dataset is required to get more accurate and precise results. The dataset used in this study was taken from Internet Movie Database (IMDb) and Box Office India (BOI) which contains all the attributes like the title, actor, director, writer, genre, month of release, budget, IMDb rating, duration etc. These were the attributes which a usual dataset contains. So, in order to obtain results, the dataset was needed to be properly transformed and then implemented. Here, is an overview of how the dataset was made and then the methods used in this study.

5.1 Dataset description

The dataset initially contained 81,273 records in it. There were three stages to extract this data.

These three stages are:

  1. 1)

    Data extraction.

  2. 2)

    Data processing.

  3. 3)

    Feature engineering.

Figure 4 depicts the sequence followed for the formation of the dataset in which the dataset was first extracted, then processed to retrieve the useful information. On processing, the data was then transformed which contained only those attributes which are further used for the implementation of various models. Initially, the dataset contained many missing values and other challenges that are discussed in Sect. 4.3 which was a significant challenge for the movie prediction accuracy. These obstacles were resolved using various data transformation techniques which helped to prepare the final dataset.

Fig. 4
figure 4

Dataset flowchart

5.1.1 Data extraction

At the initial stage, the dataset was extracted from IMDbFootnote 2 and BOIFootnote 3 websites [28]. The dataset contained different attributes like the title, IMDb—id, genre, duration, date of release, month of release, actor/actress, directors, writers, producers, co-actors, IMDb rating, BOI rating, budget, etc. At that point, the individual dataset from IMDb and BOI was consolidated to shape one dataset. The extracted dataset then comprised of 81,273 movies having 22 features, which were from Bollywood, Hollywood as well as South Film Industry.

5.1.2 Data processing

Once the dataset is analyzed, it is preprocessed in order to remove redundant dataset [4]. The dataset contained both Hollywood as well as Bollywood movies. As this study is completely based on Bollywood movie success rate prediction, all the Hollywood dataset was removed. The dataset then comprised of 5826 movies out of which movies released after 2000 were taken into consideration. After removing all the duplicate entries, the dataset now contained various non—common attributes like the BOI rating, co-actors, etc. These attributes were removed from the dataset in order to remove inconsistency. Thus, the merged dataset was processed so as to remove all the duplicate records as well as those columns which were not common were removed from the dataset. This gave an improved dataset which contained 1951 movies data.

5.1.3 Feature engineering

After processing the dataset, various transformations were made, i.e., some attributes like the date of release, producers, writers were removed with the goal that the study could concentrate on the key features that assume a significant job in predicting movie success [32]. After this the final dataset consisted of 1951 movies with 14 different features.

The next challenge was to fill up the missing values in the dataset in order to maximize the success rate. To overcome this challenge the dataset was first visualized for all the missing values by plotting graphs between features and budget of the movies. At that point the missing values were ordered into two classes: a portion of the missing values in the dataset was obliged by the mean calculated from the complete dataset. While, the staying missing values were filled taking mode according to the individual characteristics.

Further one hot encoding was done on the dataset and scaling was done after studying the variation of different attributes. The next step was to find out the key attributes that affect the outcome of the models. This was done with the help of correlational heat map. The heat map depicts the relationship among the different attributes that have the highest correlation among themselves. The final features that were considered are shown in the heat map along with their correlation. At the final stage of the dataset five attributes, genre, month of release, IMDb rating, duration and budget were considered. The final dataset thus obtained contained 1951 movies with 5 key attributes.

After obtaining the final dataset, the next task was to implement the study of different algorithms on the dataset to obtain results for the movie prediction. Figure 5 shows the overall flowchart of the complete study and gives an idea of how various machine learning algorithms and ensemble models were implemented on the dataset in the sequential order. The general portrayal of different advances done in this study is shown in Fig. 5.

Fig. 5
figure 5

Overview flowchart

6 Results and discussion

Initially, the prepared dataset has been engineered to the best features as shown in Figs. 6, 7, 8, 9, 10. The analysis drawing various insights to understand the impact of taken factors on success of the movie is discussed below.

Fig. 6
figure 6

Plot on the basis of rating

Fig. 7
figure 7

Plot on the basis of Month

Fig. 8
figure 8

Plot on the basis of Genre

Fig. 9
figure 9

Plot on the basis of Duration

Fig. 10
figure 10

Correlational heat map of the attributes

Figure 6 depicts the variation of movie’s IMDb rating that were released between the year 2000 and year 2019. Rating is one of the most important attributes in predicting the movie success rate as people reviews are the most important to train our models and then implement them for the prediction. The variation in rating was used in scaling the attribute of the dataset which is therefore, one of the most important attributes in predicting the success of a movie.

Figure 7 shows the variation of number of movies released in different months. The Figure shows the highest movie releasing month is September while least is December. This concludes that month of release can be one such attribute which can be used to predict the movie success rate. As the greater number of movies released in a particular month as well as near the festivals will attract a lot of audience. The audience will therefor play an important factor in the movie success prediction.

Figure 8 shows the number of movies and the genres associated with them. Multiple genres were associated with a single movie. Genre is the key attribute to define any movie in a simplest manner. This attribute helps generate interest and excitement among the audience for the movie. Hence, this attribute also served a measure role in the movie success prediction. Therefore, it has been taken into consideration.

Figure 9 represent the duration of movies with respect to number of movies from the dataset. Duration was indeed one of the major factors that affected the accuracy of the applied machine learning models and ensembles. Therefore, it has been considered as another major attribute for the success prediction.

After considering the all attributes such as month of release, duration, IMDb rating, genre, budget, actors, directors etc. correlational heat map as shown in Fig. 10 has been structured which showed the best output using the six major attributes which are month of release, IMDb rating, duration, genre, budget, hit/flop.

The above Fig. 10 which shows the correlation heat map depicts that when a correlation heat map of attributes like month, genre, IMDb rating, budget, hit/flop has been plotted, best results and combination was obtained out of the complete dataset.

With the help of these plots and correlational heat map, the task of scaling was done easily. A correlation matrix was then created which shows the relation among different machine learning algorithms as shown in Fig. 11. This correlation matrix was the base in order to choose the best suited algorithms for different ensembles techniques. The best combinations have been studied and then implemented (Fig. 12).

Fig. 11
figure 11

Correlational matrix for different models

Fig. 12
figure 12

Structure of confusion matrix

In this study, various algorithms were applied. The dataset was divided into train and test for the implementation of various algorithms. The proposed ensemble’s training time varied between 3 and 5 s. The results provided in the section deals with the confusion matrix of the model along with the ROC curve obtained after implementing the model. The confusion matrix is a table that tells the statistical outcome of a model. The diagonal elements represent the true-positive value (TP) and the true-negative value (TN), while the other two represent the false-positive value (FP) and false-negative value (FN) [18]. Thus, the accuracy score can be calculated by the formula marked as Eq. (1).

The receiver operating characteristics or ROC curve [11] is a graphical analysis of the two values i.e., true-positive rate, which is also referred to as “sensitivity” which can be calculated from Eq. (2) and false-positive rate, which is referred to as “1-specificity” as shown in Eq. (3). The specificity and sensitivity have an inverse relationship i.e., one increases with a decrease in others and vice-versa. The area under the curve (AUC) is a characteristic of the ROC curve that depicts the precision of the model applied. The more curve shifts to the top left corner, the AUC value increases subjecting more true values resulting in increased accuracy of the model.

$${\text{Accuracy}}\,{\text{Score}} = ~\frac{{{\text{Correct}}\,{\text{Predictions}}}}{{{\text{All}}\,~{\text{Predictions}}}} = \frac{{TP + TN}}{{TP + FN}}$$
(1)
$$TPR = ~\frac{{True\,Positive}}{{All~\,Actual\,Positive}} = ~\frac{{TP}}{{TP + TN}}$$
(2)
$$FPR = ~\frac{{False~\,Positive}}{{All~\,Actual\,~Negative}} = ~\frac{{FP}}{{TN + FP}}$$
(3)

The error approximation in applied models have been calculated in three forms which are mean average error (MAE), mean square error (MSE) [19] and root mean square error (RMSE). MAE or mean average error depicts the average error of the predicted values and the original values taking magnitude of the difference found which can be found in Eq. (4).

$$MAE = ~\frac{1}{n}~\sum | ~y - \hat{y}|$$
(4)

The MSE or mean square error is the error calculated by taking the mean of the square of difference of predicted and actual value and can be calculated from Eq. (5).

$$MSE = ~\frac{1}{n}~\sum {(~y~ - ~\hat{y}~)^{2} }$$
(5)

The RMSE or root mean square error [20] is preferred over MAE and MSE in case of large-scale errors. As the name suggests, RMSE is calculated by taking the square root of MSE as shown in Eq. (6). The RMSE gives a higher weight to large errors as it takes the square of predicted and actual values. The values which are closer to 0 are better as they all are negatively directed errors.

$$RMSE = ~\sqrt {MSE}$$
(6)
$$RMSE = ~\sqrt {\frac{1}{n}~\sum {(~y~ - ~\hat{y}~)^{2} } }$$
(7)

K-Fold is a popular technique used to make comparatively a less biased model. It ensures that every subset of the dataset gets a chance to perform in both the training and testing section. K-Fold plays an important factor in terms of accuracy because it splits the dataset into k-sections on which training and testing are applied to each section. This results in equal contributions from every section of the dataset, which helps to get better accuracy.

Figures 13 and 14 represents the confusion matrix and ROC curve of the SVM model applied. The diagonal elements represent the true positive and true negative values while other elements show the false values in the prediction. The ROC curve prepared shows that the prediction output is just satisfactory.

Fig. 13
figure 13

Confusion matrix of SVM

Fig. 14
figure 14

ROC curve of SVM

Figures 15 and 16 represent the confusion matrix and ROC curve of KNN model applied in the study. The confusion matrix was created on the prediction by KNN model on a subset of the dataset. The ROC curve depicts that the model applied is not satisfactory. The more curve moves to the left top corner, the more accurate predictions are.

Fig. 15
figure 15

Confusion matrix of KNN

Fig. 16
figure 16

ROC curve of KNN

Figures 17 and 18 represent the confusion matrix and ROC curve of Naïve Bayes machine learning model applied in the study. The truth positive predicted values obtained in this model are quite better than the others and ROC curve with moderate sensitivity and specify.

Fig. 17
figure 17

Confusion matrix of Naïve Bayes

Fig. 18
figure 18

ROC curve of Naïve Bayes

Figures 19 and 20 shows the confusion matrix and ROC curve of the random forest model. The true positive and true negative values were significantly good for making of the ROC curve which shows better outputs among other supervised learning algorithms.

Fig. 19
figure 19

Confusion matrix of random forest

Fig. 20
figure 20

ROC of random forest

The results of the baseline models like KNN, SVM and various other models, which have been applied in the study gave the précised results but further ensembles were applied to increase the accuracy. The accuracy of the ensembles came out to be far better than the baseline models applied. An increase in AUC value along with improved ROC curve was also observed in the further results as shown below.

Figures 21 and 22 represent the confusion matrix and ROC curve of gradient boosting model. The confusion matrix shows the best results with highest percentage of true values in comparison with other models. The ROC curve shown above depicts the high accuracy of the model with high sensitivity and specificity.

Fig. 21
figure 21

Confusion matrix of GBM

Fig. 22
figure 22

ROC curve of GBM

Figures 23 and 24 depict the confusion matrix and ROC curve of AdaBoost model of machine learning. The figure illustrates that the results are much better than other models with a high AUC value as interpreted by ROC curve which leads to the improved accuracy.

Fig. 23
figure 23

Confusion matrix of AdaBoost

Fig.24
figure 24

ROC curve of AdaBoost

The increased true-positive values in the confusion matrix results in increased sensitivity in the ROC curve, and a higher cut-off value. The cut-value in ROC curve is the point where “sensitivity + specificity-1” is maximum.

Figures 25 and 26 shows the confusion matrix and ROC curve of XGBoost model. The confusion matrix shown above and confusion matrix of AdaBoost came out to be similar, therefore showing better predictions than the other machine learning models. The ROC curve prepared was also similar to that of AdaBoost model with high cut-off value.

Fig. 25
figure 25

Confusion matrix of XGBOOST

Fig. 26
figure 26

ROC curve of XGBOOST

Figures 27 and 28 shows the confusion matrix and ROC curve of multilayer perceptron neural network (MLP-NN) model. The valued obtained in confusion matrix were satisfactory which gives an average ROC curve with moderate sensitivity which in turn result in a little increase in specificity.

Fig. 27
figure 27

Confusion matrix of MLP—NN

Fig. 28
figure 28

ROC curve of MLP—NN

Figures 29 and 30 represent the confusion matrix and ROC curve of the ensemble 1 using voting classifier with 3 machine learning models i.e., SVM, Naïve Bayes, and KNN. The ROC curve shown depicts the moderate accuracy of the model applied.

Fig. 29
figure 29

Confusion matrix of Ensemble 1

Fig. 30
figure 30

ROC curve of Ensemble 1

Figures 31 and 32 shows the confusion matrix and ROC curve of ensemble 2 combining SVM, Naïve Bayes and KNN using gradient boosting classifier. The ROC curve shown has less area under the curve, showing average results.

Fig. 31
figure 31

Confusion matrix of Ensemble 2

Fig. 32
figure 32

Roc curve of Ensemble 2

Figures 33 and 34 shows the confusion matrix and ROC curve of ensemble 3 combining SVM and KNN models using AdaBoost classifier. The results obtained by the confusion matrix and ROC curve are middling with moderate accuracy.

Fig. 33
figure 33

Confusion matrix of Ensemble 3

Fig. 34
figure 34

ROC curve of Ensemble 3

Thus, after making the confusion matrix and the ROC curves of all the algorithms were studied and combining all ROC a single plot was made as shown in Fig. 35.

Fig. 35
figure 35

Combined ROC curve of all the algorithms

On implementing these algorithms on our dataset and studying the ROC curves. Table 3 shows the accuracy and comparison of all algorithms.

Table 3 Results obtained from various models/algorithms

In recent times, various studies have been proposed on movie prediction. The different studies predicted the success of the movies by taking different set of attributes best suitable for the applied models. The dataset varied in different works, but our study approached with better outcomes working with the dataset of 1951 movies. A comparative analysis with previous works on the same subject is shown in the Table 4.

Table 4 Table of comparison with existing work

In this study, gradient boosting gave us the maximum results. Results were also obtained by using various ensemble learning techniques. XGBoost was also used to train and test the dataset which gave an accuracy of 83.54% after applying 10-fold validation. Table 3 also shows the accuracy and AUC of different algorithms with and without folds for better results, both the accuracies were calculated and compared.

The maximum accuracy is obtained by using gradient boosting i.e., 84.1297% without folds and 83.6518% after applying tenfold validation as shown in Table 3.

7 Conclusion

The cinema industry is one of the world's fastest-growing industries. Predicting a film’s success at the box office before it's released is one of the crucial steps in the future film industry that will aid in growth and development. A film's popularity is not only dependent on film-related factors but also on audience opinions. In this research, we use different machine learning algorithms and ensembles to construct various models to predict a film's success rate. Initially, the data was extracted and processed, creating a final data set of 1951 movies with various attributes for prediction. KNN is the least successful algorithm for prediction with the success rate of 80.5461%. Gradient Boosting, AdaBoost and XGBoost shows great results with Gradient Boosting being the best of all having a success rate of 84.1297% and AUC 0.815. AdaBoost and XGBoost have the success rate of 82.9352% each and AUC 0.808 and 0.812 respectively.

For future improvements, more features like user reviews, actors, directors, etc., that would also have a great impact in predicting the success of a movie can also be considered. The prepared dataset can also increase, so better results would be obtained from the model. New ensembles will also be implementing in the future work to check the improvement of success rate. Check the improvement of success rate.