1 Introduction

The Internet is the best source of information nowadays and, remarkably, every field massively uploading data over the internet and enormously becoming fast and efficient. The movie industry is also producing extensive data related to stars, directors, studios, critics score, ratings, and much more over the web, and it facilitates researchers to mine the data, trace, and identify the hidden patterns inside this big data related to movies [23]. Producing a successful movie is not an easy task for moviemakers. Focus only on some of the factors, such as genre, casting, and starring of a movie, could not enough due to the varied liking of the audiences. Other factors (such as director, famous actor/actresses, genre, and the cost) are considered as conventional factors, and also some non-conventional factors (such as movie trailer views on YouTube, likes on Facebook, and fan following on Twitter) make a movie successful [29].

Motion Picture Association of America (MPAA) declares that the growth and enlargement of the movie industry is a global phenomenon, and due to the influence of the movie industry on the economy of the country, many studies have conducted by scholars using machine learning techniques to predict the box-office success. Due to the prediction perspective, these researches have an essential effect on the movie industry [24]. The movie industry releases thousands of movies each year. As per the findings, in the United States, the movie industry produces profit up to 10 billion dollars, and almost every movie costs about 100 million, but despite their cost and production, still there exist some ambiguities and vagueness that either the movie will do business or not? [19].

As per the business perspective, the movie industry is one of the highest revenue-generating businesses. Of course, the success of a single movie can earn millions of dollars of profit for a studio, and the moviemakers are excitedly interested in making revenues from the movies through early predictions as a movie gets popular in the public results gross revenues from the community as well [48]. Most of the people have their hobbies to watch movies, and they are crazy about it. The movie is an excessive source of enjoyment, and people love to watch in the theater as well. Due to the liking of the divergent audience, the movie industry produces thousands of movies of diverse genres (such as Action & Adventure, Mystery & Suspense, Science Fiction & Fantasy, Comedy, and Documentary) every year [23].

Hollywood is the land of intuition, as the bulk of movies of varied interests and topics released every year in the United States. The situation is still unclear and uncertain for the studio that a movie will be successful or not, and this leads to the thought of prediction of movie success before its release [19, 47]. Cizmeci and Oguducu reveals that exposing the significant factors before releasing a movie could aid to box-office success. The producer and the other film making personnel could make proper decisions to make a movie hit. For example, if the movie becomes successful, then more audiences could watch it in various theaters, and the revenue will surely increase [10].

Several studies have conducted to predict the popularity of the movie. Mostly include user rating scores, while others used social media to predict the movie, e.g., Facebook, Twitter, and YouTube. Though on the other hand, limited work had conducted by considering movie features, such as dates, Oscar-winning stars, director, studio, and runtime for prediction of the movie [23]. In the study of Tang and colleagues, they evaluated movies with the same genre from IMDb and DouBan (Chinese social networking service). Initial findings could not produce solid evidence in support of the influence of foreign language on the popularity of the movie due to the limitation of the data. Afterward, they found the positive and negative sentiments, which can be taken as a robust indication of the recommendation and could help in predicting the popularity of a movie [51].

According to Wang and Zhang, movie genres contribute a vital role in the popularity of the movie because the movie industry firmly makes decisions on what type of movie customers of different ethnic groups liked, rate, and favored the movie. Producing revenue generated movie is the eventual goal of every movie industry, and they depend on various market segments and customers’ likings [53]. As far as the prediction of the most wanted and likely movies is concerned, Netflix’s algorithm is the best example of the supremacy of data analytics/mining in the movie industry as the algorithm accurately predicts which particular movie an individual customer wants to watch next [16].

On the other hand, the availability of sufficient data about the movies over the web prompts to inspect knowledge discovery/engineering, data mining, and also machine learning. Movie industry and film producers become unsure whether the movie will get fame or do business in the future or not. They always think about how to market the movie, which target market should be focused, when to release the movie and how to publicize it. It is the reason that predicting a movie before its release is of the utmost significance to the film industry [5, 19].

In the study of Lee and colleagues, they proposed such a model that can lessen and reduce the ambiguity in predicting the performance of a movie. They investigated, past research has conducted using machine learning techniques, presents an equally high level of prediction performance and accuracy. However, discovering important prominent features might be substantial to anticipate the achievement of the movie. The power of the prediction model presented in the study of Lee and colleagues becomes inadequate because they only used alteration of the algorithms rather than concentrate on the feature selection and extraction [24].

Quader and colleagues tested and compared seven machine learning algorithms to predict the box-office success of a movie. They predicted the profit value based on pre and post-release features taken into consideration. They also apprised and highlighted the other features, such as the number of audiences, the economic condition of a country, Law and order situation, total annual ticket sold, and Gross Domestic Product (GDP) of a country as well to make a better prediction of a movie’s box office success [41].

The prime objective of the past researchers was to introduce new machine learning algorithms and test its performance only, although their efforts have donated to the substantial growth of the prediction accuracy. However, many factors and perceptions could be considered further to improve and enhance prediction accuracy. For instance; It is measurable to explore the hidden, unfamiliar, and unknown features. Other ones are feature selection, and feature extraction from the existing features as it is one of the most often commonly considered methods to advance the accuracy and interoperability of machine learning algorithms [24].

Motivated by these previous studies, our study aims to utilize and extract the relevant features from the IMDb data to further understand the popularity of a movie. We focus on the feature aspect approach to improving prediction accuracy in this study. Further, we investigated the use of statistical and machine learning modeling and compared them to identify which are the best fit for the regression problem. The models identify the different patterns in the data, where the patterns can be identified as reflecting essential factors of prediction. It can also quantify which predictor occurrence worthy of the movie’s popularity prediction. Moreover, due to the massive movies’ data, it is possible to gather more features by fine-tuning the input parameters and criterions.

The rest of this article arranged as follows. Section 2 describes the evaluation of past studies on predicting the success of a movie. In Section 3, dataset collection and preprocessing used in this paper given. In Section 4, we present the proposed statistical methods and modeling, and then in Section 5, we define the machine learning regression models. In Section 6, we evaluate the results of the prediction model built, discuss the various performance metrics, and analyze the predictive performance. Finally, we leave the reader with concluding thoughts and future works in Section 7.

2 Related works

The initial works embraced the research steered by B.R. Litman [27] and explored the attributes and their effects on the performance of the box-office. Litman further examined, the attributes (i.e., critics score, genre, cost of the production, suppliers, theater release date, and award taking history of actors). The movie industry at the moment kept rising since Litman’s study, and for the sake of success and popularity of the movie has been an exciting and emerging research area; therefore, enormous articles have published. Prag and Casavant [39] exhibited a keen curiosity in classifying the association amongst features such as costs of marketing campaigns, MPAA ratings, sequels, and success of a movie.

The authors of [7, 20, 38, 44, 54] identified two known problems; sparsity and cold-start always faced in a collaborating filtering approach. The sparsity issue happens when there are insufficient user ratings, and customer data are available. Performance and accuracy of the recommendation collected by survey results from limited users will be lesser than gained built on a large number of examples. The other problem is cold-start, and it arises when movies and new customers do not have adequate facts available in the recommendation system [25, 45].

Basuroy et al. [6] had examined how critical reviews affect a movie’s success, set the actor’s power and finances. The authors of [12, 14, 35] had observed the association between the actor’s star power and the performance of a movie. Many researchers applied different machine learning methods to content-based filtering, i.e., K-means, Neural Network (NNET), and Naïve Bayes (NB). For instance, the idea employed by the NB classifier aims to identify whether an item is desirable by inspecting attribute information [50, 57].

The prediction regarding success, popularity, and business of the movie relied on machine learning techniques, as these learning techniques have formed prediction models with modest stages of accuracy [13, 15, 47]. For example, [47] has implemented some machine learning algorithms such as discriminant analysis, DT, logistic regression, and NNET and inspected the performance to predict a movie’s success. The predictors they have used to forecast the movie accuracy and performance are actor’s star value, genre, MPAA ratings, special effects, sequel, competition level, and the number of screens on the initial day of the movie release. Statistically, nine output variables with the 36.9% of accuracy predicted by their most beautiful performing machine learning model.

Zhang et al. [59] has proposed the NNET multi-layer backpropagation that has a better quality enhanced neural network model offered by [47]. Their model acceptably categorized six output variables with 47.9% of accuracy. Eliashberg et al. [15] has predicted the movie’s return on investment relied merely on its script information using a DT. Zhang and Skiena [58] used electronic media articles to predict the gross of movies. Asur and Huberman used data from social media and using sentiment analysis to predict the future of the movies concerning the box office revenue or business [3].

For instance, anticipating such movies that are highly predictable to succeed is one of the research type, [4] they had considered social media data, i.e., Twitter, to forecast a movie success and [33] had utilized blogs information to predict sales of a movie. Asad et al. [2] used IMDb data and from Box office mojo, and for predicting the movie, they implemented PART and C4.5 concurrently with the correlation coefficient matrix as a measure. They formed two dataset pre, and post-release movies and an experimented with it.

Additionally, parallel work had obtained in [37], where they focused on and used social media YouTube and Twitter comments for a similar objective. Mestyán et al. [32] got articles from Wikipedia and presented the prediction of the popularity of a movie. The study demonstrates that by using these articles, one can get nearly future outputs. In this research, they used Pearson’s correlation coefficient and linear regression. They took features of the movie, such as genre, release date, stars, and director from Metacritic, and also used financial data from box office (i.e., opening revenue, and budget) from the figures.

Babu [5] used movie data from two online website sources, i.e., IMDb and Rotten Tomatoes, and one from Wikipedia as well. Babu collected data and implemented machine learning algorithms, such as linear & logistic regression, and support vector machine (SVM). Du et al. [13] has predicted the box office achievement by estimating the performance of three machine learning algorithms, i.e., linear regression, SVM, and NNET, examining the feelings and opinions of the texts poled on Tencent Microblog.

Different studies have been carried out by many researchers for predicting the movie success, for example, social media, blogs, electronic media, print media, and publications, but still, there are shortcomings of researching features of a movie [23]. After getting succinct knowledge from the prior studies, some researchers have instigated to conduct the research that has a predictive nature. Mostly in past studies about the movie industry have had descriptive, illustrative nature, inspecting aspects, or features that disturb the box-office performances of movies [24].

Kim and colleagues applied lexicon-based sentiment classification and machine learning methods for predicting the success of a movie. They established a sentiment dictionary by using feature extraction and polarity assignment. Their findings showed a strong positive relationship between the sentiments of the audiences and box-office success. The relationship also significant and improved prediction accuracy by using a linear regression model [22].

Wang and Zhang [53] used the two approaches in their research, i.e., collaborative filtering and content-based filtering. In collaborative approach items of attention to a specific user grounded on the resemblance to prior rating history, and in the content-based filtering method or approach, the procedure is constructed on details of items and user likings to recommend items to customers further. The method relates to the user’s likings with illustrations of the new items and also matches with item features.

2.1 Literature review

The related terminologies used in past research for predicting the success of a movie enlisted in Table 1.

Table 1 Summary of past techniques

3 Dataset collection and preprocessing

In this section, we explained the steps involving the data collection and preprocessing, which is an essential step before applying machine learning methods and techniques shown in Fig. 1.

Fig. 1
figure 1

General proposal

3.1 Data collection & extraction

The dataset used in our study collected from the IMDb webpage, and it includes movies released from 1972 to 2014. For making more accurate predictions, we selected those movies that are listed on Wikipedia list of years in film pages and are English movies released in the United States, and the rest excluded. We also removed movies which do not have any information about box office details.

The given data set consists of 651 randomly sampled movies produced and released. Data randomly sampled so; therefore, we can assume the generalizability of our conclusions. There is no random assignment used as it is observational data not experimental so, therefore, we cannot assume any causal relationship between the explanatory and response variable.

3.2 Data preprocessing

The data we attained from the available online database, i.e., IMDb, and need to be cleaned as the data are incredibly prone to noisy, and missing due to the massive size from a publicly accessible online source [17]. Initially, our data record was consisting of 651 rows with features related to movies as listed in section 3.3, Table 2. After cleaning of missing values by ignoring incomplete observations, such as, features with missing information represented by “N/A” or left blank wholly deleted from the data set to avoid skewing the results. This initial round of cleaning provided 632 complete responses.

Table 2 Proposed movie popularity features

3.3 Data integration & transformation

The next step is, integration and transformation of the data into one database as data are coming from heterogeneous sources. Through this step, we can implement a statistical analysis and regression process more efficiently and quickly. Our dataset comprises both nominal and numeric attributes. For a regression process, we need all features to be numerical, and for this purpose, we used statistical programming language R (https://cran.r-project.org/) to accomplish this task. List of anticipated features shown in Table 2.

3.4 Discretization of the movie popularity

In this study, we define the prediction of the popularity of the movie as a regression problem. This approach applied in a few earlier studies, e.g., [5]. We discretize the dependent variable (i.e., imdb_rating) because it has continuous numeric values.

4 Research methodology

In this section, we describe the methodology behind experiments that performed.

4.1 Exploratory data analysis

4.1.1 Selection of predictors

After setting a research question, we now turn to choose which variables to include in our model and eliminate or drop those variables which are not useful for our model. Table 3 reveals the reason for the rejection of other predictors.

Table 3 Excluded predictors

So, after elimination, we are left with five nominal, and six numerical types of features in this study amongst 25 features shown in Table 2. Few features, we have selected including the ones that widely used in past studies. Besides, we have also nominated the features which corroborate statistically and are enough to predict the popularity of the movie successfully [24].

We have used R and RStudio (https://www.rstudio.com/) to convert the categorical/nominal features to some numeric values. It converts these values into binary features. A variable with more than two possible values converted into n-binary features, where n represents the number of values. For instance, genre, one of the features in this study, has eleven possible values, including Action & Adventure, Animation, Art House & International [24].

The following sub-sections describe the nominal features included in this study.

Genre

It is one of the most simple and frequently used variables in predicting a movie’s success [47]. In this study, we used the eleven categories as follows: ACTION & ADVENTURE, ANIMATION, ART HOUSE & INTERNATIONAL, COMEDY, DOCUMENTARY, DRAMA, HORROR, MUSICAL & PERFORMING ARTS, MYSTERY & SUSPENSE, OTHER, and SCIENCE FICTION & FANTASY. The information on movie genres are available on the webpage of the IMDb.

MPAA_rating

Assigned by MPAA to the movie. A film rating system used in the United States. These ratings signify violence, sexual content, and language in a movie. There are six categories for each of the movies, mainly G, NC-17, PG, PG-13, R, and Unrated [23].

Studio & director

The data about the studio & director of a movie, producing studio could be useful in modeling. There are too many values in the corresponding variables, e.g., WARNER BROTHERS PICTURES, twentieth CENTURY FOX, COLUMBIA PICTURES, DISNEY, HBO, PARAMOUNT STUDIOS, etcetera. Instead of using them directly, we are going to divide directors and studios into four ranks. A rank is a number from 0 to 3. If the average rating of movies for a studio or a director falls into the first quartile of the distribution of imdb_rating, we assign “Rank 0”. “Rank 1” for the second quartile, and so forth. We need a function to determine the quartile of value for that. Since the distribution is not normal, we cannot use the theoretical method of determining quartiles. Instead, we are going to use the “ecdf” function in the R language.

Best_pic_nom

This variable contains the two possible values “Yes,” and “No.” Table 4 shows the detail of the nominal predictors used in this study.

Table 4 Summary of the nominal predictors

4.1.2 Selection of (predicted) response variable

We are interested in learning what attributes make a movie popular – so, we have a few variables to choose from the list. Here are the details of the popularity related variables that are continuous numerical. For the regression model, we selected two features for the response variable:

  1. 1.

    imdb_rating: Rating on IMDb

  2. 2.

    imdb_num_vots: Number of Votes on IMDb

Both of these look-like legit measures of popularity, so, we will choose our response variable concerning their distribution only. We have used the "ggplot2" library in R to draw plots shown in Fig. 2.

Fig. 2
figure 2

Statistical distribution of the response variable

Figure 2 proves that imdb_rating is closest to a normal distribution, which should contribute to the robustness of the model, so this shall be our response variable.

4.2 Investigation and feature selection

There are many standards available for feature selection (such as Backward Elimination, Forward Selection, Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Deviance Information Criterion (DIC), Bayes factor, and Mallow’s Cp). In this study, we used Backward Elimination using the adjusted R2 method to construct our model because it is a common way [52].

In this technique, we start with the full model and eliminate one variable at a time until the parsimonious model is reached [43]. In the end, features that are the redundant and weaker correlation with the response variable eliminated. Important steps involved in the Backward Elimination using the adjusted R2 method [5] shown in Fig. 3.

Fig. 3
figure 3

Feature selection using a statistical technique

4.2.1 Explorations

After eliminating unwanted variables and choosing our response variable, we go ahead and get a feel for the data using some summaries and plots shown in Figs. 4, 5, 6, and 7. Figure 4, reveals the relationship is not very strong because there are some anomalies seen in the data of genre. The black dots show the outliers in the data, which accelerated the mean above the value of 6 in the imdb_rating.

Fig. 4
figure 4

A boxplot shows the relationship between genre and imdb_rating

Fig. 5
figure 5

A scatter plot showing the relationship between runtime and imdb_rating

Fig. 6
figure 6

A scatter plot showing the relationship between thtr_rel_year and imdb_rating

Fig. 7
figure 7

A box plot showing the relationship between thtr_rel_month and imdb_rating

The plot Fig. 5 below, assures good functional relationship as the runtime of the movie goes longer or more prolongs; the imdb_rating goes higher than the previous ratings.

Now, look at Fig. 6, which demonstrates the time factors, as we have two of these, i.e., thtr_rel_year, and thtr_rel_month. There appears to be some fan shape trend over the years, as variability grows slightly higher as years go along but no apparent trend within the months. It has shown in Fig. 7 separately, using a box plot which shows the outliers at or below 4 of the imdb_rating, and removing these outliers could illustrate at the mean value of imdb_rating. We saw some differences, but it does not seem like much to account for its significance without tests, which we have performed in section 4.2.3 Model diagnostics.

4.2.2 Statistical modeling

After a straightforward elimination of predictors, we ended up with the list of features and applied multiple linear regression model to achieve a model with a high adjusted R2 value. The technique starts with the set of all features. We iterate over the full features, at each iteration, it checks the adjusted R2 value, if it gets even a slightly greater change in the value, removes one of the collinear predictor variables remaining in the set.

Finally, in the end, it gives a robust model with the assurance that all the predictors correlated with the response variable, and the redundant predictors eliminated. Table 5 reveals a summary of the final model.

Table 5 Final multiple linear regression model interpretation

With regard to inference for the model, the p-value of the model’s F-statistic indicates that the model as a whole is significant. It also noted that not all predictors have a significant p-value as the model was developed using the highest adjusted R2.

Interpretation of the model coefficients coefficient for director_rank shows that for each unit increase in the value, the imdb_rating is increased by approximately 6% with a very low p value, similarly for each unit increase in the value of studio_rank, the imdb_rating is increased by approximately 1% with a very low p value as shown in Table 6. We might prefer to look at an ANOVA table too:

Table 6 Analysis of variance

Here, we can see that all independent variables are significant predictors based on their p-values.

4.2.3 Model diagnostics

Validity

In order for the multiple regression model to be valid, it is mandatory that the model should validate below four conditions:

  1. 1.

    There is a linear relationship between any numerical predictor variables (runtime, thtr_rel_year, thtr_rel_month, dvd_rel_month, imdb_num_votes) and the response variable (imdb_rating).

  2. 2.

    The residuals are nearly normally distributed.

  3. 3.

    Residuals display constant variability and

  4. 4.

    The residuals are independent.

First, we will examine whether the binary variables included in the model are linearly related to the response variable or not? Figs. 8, 9, 10, and 11 demonstrates graphically and satisfies the above-stated conditions.

  • Condition 1: Linear relationship between numerical (x), and y

Fig. 8
figure 8

Linear relationship between numerical predictors and response variable

Fig. 9
figure 9

Distribution of residuals with mean zero

Fig. 10
figure 10

Statistical model predicted values vs. residuals

Fig. 11
figure 11

Representation of residuals vs. all predictors

Figure 8, illustrates the imdb_rating by examining the distribution of the residuals and observe whether the numerical variables included in the model are linearly related to the response variable. A residual is a difference between the observed value and the actual or theoretical value. Thus, Fig. 8 validates the condition 1.

  • Condition 2 : Nearly normal residuals with mean zero

In Fig. 9, Histogram and Normal probability plots demonstrate the residuals are nearly normally distributed and satisfy the condition 2.

  • Condition 3 : Constant variability of residuals

Figure 10, reveals the residuals’ constant variability and allows for considering the entire model with all explanatory variables at once. It depicts in Fig. 10 and satisfies the condition.

  • Condition 4 : Independent residuals if time series structure suspected

Figure 11, confirms that the residuals are independent. As the plot shows the relationship of residuals among all the explanatory variables, and it seems near the mean with no fan shape presentation.

5 Construction of the machine learning regression model

The statistical analysis above satisfies that the data set we used is concrete and robust enough to implement the machine learning techniques on the given data. Although machine learning algorithms work on the principles of Statistics but performing Statistical tests and models is much better before applying machine learning techniques to the data.

In this study, we used a supervised learning technique as a response variable output is known. We used five machine learning methods to build candidate models for predicting the popularity of the movie and will compare the performance of different methods.

5.1 Generalized linear model (GLM)

It works and evaluates on maximum likelihood estimation (MLE), a well-known statistical principle. The primary objective of GLM is to minimize the difference between the actual and the forecasted value of the response variable, which is Gaussian distributed and called a residual [36].

GLMs are the augmentation of old-fashioned linear models, and these models use the series of commands by using the well-known MLE technique. These models are speedy and perform parallel computation even with a smaller number of predictors with non-zero constants [42].

5.2 Deep learning (DL)

Old style multi-level NNET usually used to learn non-linear relations. Whereas DL is used to train with “stochastic gradient descent using back-propagation” as built on multi-layer “feed-forward artificial neural network.” It uses many hidden layers comprise of nodes with incorporated activation functions. DL has some advanced features, such as “L1 or L2 regularization, adaptive learning rate, momentum training, and drop out”, used to allow high predictive performance [42]. As per Li and colleagues, DL works from a neural network that offers information about other data as input and produces the outcome by using many layers [26].

DL initiates the method by using widespread hidden layers that contain nodes to produce the result, while the traditional neural network only considers a single hidden layer [55]. According to Schmidhuber, the old NNET requires more material for features to perform feature selection, and for domain knowledge of data. On the contrary, DL does not require any substantial facts about features [46]. Xing and Du validates that DL can automatically tune and select the model at an optimum level and also has the built-in quality to mine the features without any participation and collaboration of humans, which fabulously saves much time [55].

5.3 Decision tree (DT)

It is a tree-like structure, has nodes, i.e., Internal and leaf. Mostly, data whose output label is unknown, DTs are implemented to classify them, and the route from root to leaf must be trailed. It made by training data which comprise of data records, and each record formed by a set of features and output label. It covers either distinct or continuous values [21]. DT is a distinguishable and straightforward structure. Each node represents the splitting rule for a feature to classify the target value. Dataset has nominal and numerical features, DT can implement. Primarily, in DT, a response variable must be nominal for classification and numerical for regression [42].

5.4 Random forest (RF)

RF by Breiman links numerous tree input variables in a group. Trees could be broken down when new incidences classified, and each tree states a classification [8]. From a cumulative number of polls quantified by the group of trees, the forest then elects which label to assign to this new occurrence [1].

[31] applied this technique to predict the fortitude of students in science and engineering discipline. RF produces many random trees on different subgroups of data, and the successive model builds on the polling of these trees.

Due to this modification, it is less likely to overtraining. Minimal leaf size for the classification task is 2 and 5 for regression [42].

RF is an algorithm that combines the arbitrarily made autonomous DTs to make predictions [8, 24]. Generally, RF presents meaningfully improved performance. Moreover, RF has an excellent capability to deal with irrelevant inputs [34].

5.5 Gradient boosted trees (GBT)

It has the proficiencies of parallel computing and also the active linear model solver. Due to these capabilities, it produces excellent performance and accuracy, also linked to Gradient boosted machine (GBM), another boosted algorithm. Moreover, it can form decision trees which are distinct logical models [11].

It is an ensemble of either classification or regression tree models. They are forward learning methods that attain predictive results through increasingly better estimations. By applying weak classification algorithms to gradually changing data, lots of DTs created that produce groups of weak prediction models. Though boosting trees enhances their accuracy and performance. It also reduces the speed and “human interpretability.” This process simplifies tree boosting to curtail these issues [42].

GBT executes similarly to Adaptive tree boosting (ATB), another boosting algorithm. At each iteration, it uses residuals of the last prediction function [56]. GBT uses some different measures, i.e., binomial deviance, to identify the cost of errors, and it differs from ATB [9, 18]. In the case of a multicollinearity problem that exists amongst the features and the number of features is comparatively large to the number of data points, GBT is usually considered as robust [30, 40].

6 Evaluation

We have used RapidMiner 8.1 to implement the above-mentioned machine learning methods and tested them. Figure 12, reveals the flow chart specifying the movie prediction. As there are plenty of data mining/machine learning tools available and RapidMiner is one of those, and it is best suited for data mining tasks and contains a vast collection of machine learning algorithms. List of operators (such as Blending, Cleansing, Modeling, and validation) are available to perform mining of data.

Fig. 12
figure 12

Classifiers training and testing algorithm flowchart

This section defines the training and test data, as well as the performance measures used in experiments. The last subsection comprises results and analysis.

6.1 Training and test data

Data regarding movies and users were collected from the publicly accessible IMDb. Existing imdb_rating available in our data represents liking users gave in their reviews [28].

The training data were obtained by a repetitive random sub-sampling validation method. This technique reiterates the validation with the arbitrary partitions of training and test data. Moreover, this method resolves the k-fold cross validation issue in which, as ‘k’ grows, the size of the test data shrinks, and the performance variance of each sharp fold increases [49].

When the size of the data is small, the impact of such an issue can depreciate. Since, in this study, the size of the data set is limited, and hence, it has evidenced that repeated random sub-sampling is far better and appropriate than k-fold cross-validation [24]. So, we have split the training and test dataset into the 80:20 ratio, respectively.

6.2 Performance measures

In this paper, we adopted the performance metric of [28] RMSE, the most common metric used to “measure accuracy for continuous variables” (http://yahwes.github.io/) and also used to present the accomplishment of the numerous methods used in this study. Lower values of RMSE are better and calculated by using the equation no. 1:

$$ RMSE=\sqrt{\frac{1}{n}\ \sum \limits_{j=1}^n\ {\left({y}_j-\hat{y_j}\right)}^2} $$
(1)

where n is the training set which contains movies related records, yj is the real user rating for the movie in record j, and \( {\hat{y}}_j \) which is the predicted user rating.

We also used other performance metrics, i.e., (Absolute error, Relative error, Squared error, and Squared correlation), measured by equation nos. (2, 3, & 4).

Absolute error (AE) is the actual minus predicted value.

$$ AE=\frac{1}{n}\ \sum \limits_{j=1}^n\mid {y}_j-{\hat{y}}_j\mid $$
(2)

Relative error (% error) is the percentage form of AE.

$$ \% error=\frac{1}{n}\ \sum \limits_{j=1}^n\mid {y}_j-{\hat{y}}_j\mid \ast 100 $$
(3)

Squared error is the average squared difference between estimated and actual value.

$$ Squared\ error=\frac{1}{n}\ \sum \limits_{j=1}^n{\left({y}_j-{\hat{y}}_j\right)}^2 $$
(4)

Squared correlation (r2) is the square of the correlation coefficient r2,and computed by equation no. 5. It is a useful value in linear regression and measures how close the data are to the fitted regression line. Tells us what the model explains percent of the variability in the response variable.

$$ {\displaystyle \begin{array}{c}{r}^2=1-\frac{explained\ variability}{ tot al\ variability}\\ {}{r}^2=1-\frac{Residual\ s\mathrm{u}m\ of\ squares}{ tot al\ sum\ of\ squares}\\ {}{r}^2=1-\frac{SS_{res}}{SS_{tot}}\end{array}} $$
(5)

6.3 Results and analysis

We have implemented five learning algorithms and the results for each of the applied methods shown concerning their runtime in Fig. 14. Figure 13 is the weights (ranks) of the attributes which show the universal significance of each attribute for the value of the target attribute, independent of the modeling algorithm.

Fig. 13
figure 13

Representation of ranks of the attributes

Fig. 14
figure 14

Five learning models’ summary of RMSE and runtime in milliseconds

Fig. 15
figure 15

High accuracy achieving GLM model

Fig. 16
figure 16

Performance of five machine learning algorithms with other performance metrics

Fig. 17
figure 17

Subgroups of random trees of RF

Fig. 18
figure 18

Snapshot of subgroups of GBT

Fig. 19
figure 19

DT’s segregation of features

Fig. 20
figure 20

Cumulative predictive frequency distribution of five machine learning methods

The RMSE has shown for every method. The picture shows the model built on machine learning techniques and methods. It also depicts the performance achieved by each regression classifier and reveals the accuracy performance by repetitive random sampling validation technique, in which it randomly reproduces division of training and test data — additionally, the results for each of the implemented methods shown in Table 7.

Table 7 Result of machine learning methods concerning performance measures

The above results demonstrate the ratio of time we can predict the cases suitably. We achieved maximum accuracy with GLM: 0.479, RF: 0.50, and GBT: 0.495, respectively, and lower values of RMSE are always better. The other classifiers also attained good results, i.e., DL: 0.511, and DT: 0.545.

The GLM model is the one with the highest accuracy among the candidate models and the best model in this study due to the lowest RMSE value as it works on the MLE principle. This algorithm fits generalized linear models to the data by maximizing the log-likelihood. The elastic net penalty can be used for parameter regularization. The model fitting computation is parallel, extremely fast, and scales extremely well for models with a limited number of predictors with non-zero coefficients. It performs parallel computation with predictors. It was trained using the regularization and split the data into an 80:20 ratio with shuffled sampling because it builds random subsets of the training set in the performance parameter, and selected the example weights. This parameter allows example weights to be used for statistical performance calculations if possible. This parameter does not affect if no attribute has a weight role. Figure 15, illustrates the summary of the learning model.

6.3.1 Interpretation

The prediction of the model is 6.577. Essential factors for prediction show, which predictor occurrence is of utmost significance for prediction, and in this case, the most prominent support is coming from director_rank. The RMSE of all predictions done by this model is 0.479, and the relative error is about 5.28%. Also, Table 8 reveals the list of predictors according to the importance of the prediction. Figure 16, displays the other performance metrics, which we discussed in subsection 6.2. It shows the performance of five machine learning algorithms about an Absolute error, Relative error, Squared error, and Squared correlation.

Table 8 Important predictors for prediction

It has been noted that by comparing the other performance metrics, GLM, RF, and GBT are still considered high achieving accuracy models, and GLM maintained high accuracy performance, attained above 76% squared correlation.

Furthermore, RF and GBT are an ensemble of arbitrarily made autonomous DTs and an ensemble of classification or regression tree models. Figs. 17 and 18, exhibits the snapshot of random trees on different subgroups of data. Figure 19, demonstrates the DT structure, which is distinct and straightforward. Finally, Fig. 20, validates the overall prediction distribution of five machine learning models used in this study, and exhibits all the learning models are nearly normally distributed shown in Table 9.

  • Random forest (RF): It works on the bagging techniques as it is the combination of trees which randomly selects predictors at each possible split. It creates the bootstrapped dataset that is the same size as the original dataset and selects the random samples from the original dataset and then creates DTs using the bootstrapped dataset but only use a random subset of (variables) features (or column) at each step. As in this study, we are dealing with regression, so, in the end, the leaf nodes showing the known prediction values, i.e., the imdb_rating.

Table 9 Distribution of predictions
  • Gradient boosted tree (GBT): It also works on the ensembles of DTs and typically works on the boosting algorithm, which converts the weak prediction into the strong predictions. Moreover, it is different from the RF, as RF uses DTs, whereas GBT uses regression trees for prediction, as our predicted outcome is real no., i.e., imdb_rating.

  • Decision tree (DT): It selects all the features or columns from the entire dataset and then picks up one feature as a root. However, the question is how to pick the first attribute at the root? The answer is it selects according to the values given for that attribute or feature and compares & counts, which has the higher votes.

7 Conclusion and future works

This study demonstrates to predict the popularity of a movie. We have implemented a machine learning approach along with the statistical modeling for our investigation. Machine learning has plenty of robust algorithms for classification and regression. The primary objective of this research is to improve and compare the previous researches. As a result, after performing the regression, our model has predicted the popularity of the movie with the accuracy performance in terms of squared correlation (SC) is GLM: 76.6%, DL: 74.6%, DT: 71.0%, RF: 74.2%, and GBT: 75.5%, respectively. The features that contributed the most significant support are from director_rank, studio_rank, genre, runtime, mpaa_rating, imdb_num_votes, and dvd_rel_month. Moreover, the essential support is coming from director_rank, which is considered to be an important factor/predictor for prediction shown in Table 8 above, and it also confirms that the director feature is the most significant attribute for the popularity of the movie, and must be taken into consideration.

Furthermore, it is hard to perform data mining on IMDb due to lots of attributes relating to a movie in a variable scope. Our study has many moral implications, both statistically and practically. To our knowledge, our research, amongst the previous studies, is one of the few studies that have focused on the feature aspect. We have chosen features based on statistical techniques and criteria. Most of the forecasted studies using machine learning techniques emphasize on the augmentation of the predictive power, means they only focus on the building of better performing model irrespective of the model’s features taking into consideration for a better outcome. It raises a question on the black-box nature of the machine learning techniques. However, by identifying what features to include based on statistical theories, we can defend such negative reviews and criticism.

The predictive model presented here may be used to predict imdb_rating for a movie. It should be noted that the model based on a tiny sample, and some studios and directors were not sufficiently represented in the data set, which may decrease the usefulness of the model for these particular types of movies. Another shortcoming is the limited number of variables that we were able to retain in our final model. A more extensive training set with additional features are the key aspects and may improve the overall performance of the model.

We foresee our future research on the popularity of the movie in three main directions. First, we would like to experiment with a few approaches that are adequate optimization parameters, and criterion can be considered to improve and increase the accuracy of our model. Second, though, machine learning methods implemented in this study are entirely appropriate and comprehensive. However, still, many techniques can be explored and applied to solve the prediction problem in the movie domain. Third, other features could be incorporated to construct a more accurate model. We suppose that these recommendations could improve the prediction accuracy of movie popularity.