Keywords

1 Introduction

In recent times, big data is becoming one of the newest research interests in the areas of computer science and other related areas. With the possibility of a radical change in companies and organizations that use the information for improving the customer experience and transform their business models. Big data has several features which are volume, velocity, variety, value, and veracity. Big data is facing difficulties in managing using conventional tools, techniques, and procedures. Big data analytics is used for handling bulk quantities of data. It is used to mine and extract patterns, information, and knowledge from the data in an effective way. Big data analytics become an important trend for organizations and enterprises that are interesting in providing innovative ideas for enhancing and increasing their business performance and decision-making. RS are a group of techniques that allow filtering through large samples and information space in order to give suggestion to users when needed. Currently, RS are becoming highly popular and utilized in different areas such as movies, research articles, search queries, news, books, social tags, and music. Furthermore, there are other essential RS basically applicable for specialist, collaborators, funny story, restaurant and hotels, dresses, monetary services, life insurance, passion associates which give online dating services and several other social media such as Twitter, LinkedIn, and Facebook.

RS use a number of different technologies to filter out best suit results and provide to users to satisfy their information need. RS are classified into three broad groups which are content-based systems, collaborative filtering systems, and hybrid recommender system [1]. Content-based systems which try to test the behavior of the item which is labeled as recommended one. It works by learning the behavior of the new users based on their information need presented in objects whereby the user has rated. It is a keyword-specific RS where the keywords are used to illustrate the items. Thus, in a content-based RS, models work in such a way that they recommend users’ comparable items that have been liked in the past or is browsing currently. For instance, if a MovieLen user has to browse several comedies movies, then, the RS will classify those movies into the database as getting the most ratings on the comedy varieties. Collaborative filtering system is based on similarity measures between user’s information need and the items. The items recommended to a new user are those which were liked by other similar users in previous browsing history. Collaborative filtering algorithm uses an average rating of objects, recognizes similarities between the users on the basis of their ratings, and generates new recommendations based on inter-user comparisons. However, it faces many challenges and limitation such as data sparsity whose role is to the evaluation of large item set. Another limitation is hard to make prediction based on nearest neighbor algorithm, third is scalability in which number of users and number of items both increases, and the last one is cold start where poor relationship among like-minded people. To solve encounters, above mentioned, we moved to other approaches of collaborative filtering, and we landed up on model-based collaborative filtering [2]. Hybrid RS performs their tasks by considering the combining behavior of content-based and collaborative filtering techniques in such a way that it suits a particular item. Hybrid recommended system is regarded as the most frequently used RS system considered by many companies due to its ability to eliminate any weakness that might have arose when one RS is employed and in addition, its strength is the composite of more than two RS.

The main focus of this work is collaborative filtering system. It is well known that collaborative filtering could be described as a procedure whereby automatic prediction (i.e., filtering) about the interests of a user is made by gathering taste or preferences information from many users. The unexpressed assumption of the collaborative filtering approach can be best explained, viz., supposing a person A has similar opinion with person B on a particular issue, the assumption is that person A will be more likely to have the same opinion as person B on a different issue X did the opinion on X of a person chosen randomly [3]. Take for an instance the movie “RS” depicted in Fig. 1 which started with a matrix whose entries are movies rated by users. Both user (shown in green) and a particular movie (shown in blue) are represented each by column and rows respectively. Owing to the fact that not all users have rated all movies, all the entries in the matrix are unknown, which necessitate the need for collaborative filtering. There are ratings for only a subset of the movies for each user. With collaborative filtering, the idea is to approximate the rating matrix by factorizing it as the product of two matrices. That is the one that describes properties of each user (shown in green), and the other describing properties of each movie.

Fig. 1
figure 1

Low rank factorization matrix [3]

The minimization of the error for the users/movies pairs was chosen as the basis for the selection of the two matrices. The alternating least squares algorithm (ALS) which achieves this by randomly filling the user’s matrix with values before optimizing the value of the movies was used for this purpose. The value of the user’s matrix is optimized with the movie’s matrix being kept constant (Fig. 1). Owing to a fixed set of user factors (i.e., values in the user’s matrix), known ratings are employed to find the best values by optimizing the movie factors, written on top of the figure. The best user factor with the fixed movie factors is sleeted. This paper, reports for the first time, a movie recommendation system based on collaborative filtering using apache spark. The performance analysis and evaluation of proposed approach are performed on a MovieLens dataset. From the results obtained, it is concluded that the selection of parameters of ALS algorithms can affect the performance of recommender engine to be used.

The remainder of this paper is organized as follows: related work is provided in Sect. 2. Section 3 introduces the proposed movie recommender system using collaborative filtering with ALS algorithm while the experimental study is introduced in Sect. 4. Finally, the paper conclusion is presented in Sect. 5.

2 Related Work

So far, several researchers introduced and presented research in the area of building recommendation systems. Wei et al. [4] proposed a hybrid recommender model to address the cold start problem, which explores the item content features, learned from a deep learning neural network and applies them to the timeSVD++ CF model. A hybrid recommendation model is proposed which combines a time-aware model timeSVD++ with a deep learning architecture SDAE to address the cold start problem of collaborative filtering recommendation models. Kupisz and Unold [5] developed and compared item-based collaborative filtering algorithm using two cluster computing frameworks normally Hadoop’s disk-based MapReduce paradigm and Spark’s in-memory based RDD paradigm. In order to enhance the reliability, scalability, and to improve processing ability of large-scale data, Zeng et al. [6] proposed PLGM. In their work, two matrix factorization algorithms were considered, which are ALS and SGD. The parallel matrix factorization based on SGD was implemented on spark and was compared with ALS in MLib for its performance. The advantage and disadvantage of each model based on test results were analyzed. A variety of profile aggregation approaches were studied and the model which gives the best result was adopted. Models such as PLGM and LGM were studied in terms of efficiency and accuracy. Dianping, Lakshmi et al. [7] used item-based collaborative filtering techniques. In this method, they first inspect the user item rating matrix and they categorize the relationships among different items, and they utilize these relationships so as to figure out the recommendations for the user. A new concept namely movie swarm mining was proposed by Halder et al. [8] using format frequent item mining and two pruning rules. It addresses the problem of item recommendation and thus gives an idea about the user interests and famous movies trend. This technique can be very helpful for movie producers to manage their new movies. In addition to this, a new algorithm was proposed to recommend movies to a new user. A scalable method for building recommender systems based on similarity join has been proposed by Dev et al. [9]. MapReduce framework was used to design the system in order to work with big data applications. The unnecessary computation overhead such as redundant comparisons in the similarity computing phase can significantly be reduced by the system using a method called extended prefix filtering (REF). Chen et al. [10] used co-clustering with augmented matrices (CCAM) to design several methods including a heuristic scoring, traditional classifier, and machine learning to build a recommendation system and integrate content-based collaborative filtering for a hybrid recommendation system. Similarly, a collaborative filtering algorithm based on the ALS, as a powerful matrix decomposition algorithm, has been proposed by Wilkinson and Schreiber [11]. They found out that it can be awesome to extend to the distributed computing and solve the data sparse problem.

3 Proposed Movie Recommender System

This section provides the idea of the proposed system. The proposed system is a movie recommender system based on ALS using Apache Spark. The novelty of this work is based on the selection of parameters of ALS algorithms that can affect the performance of building of a movie recommender system.

3.1 Proposed System Block Diagram

In this work, we apply user’s ratings from the datasets the popular website like IMDB, Rotten Tomatoes, MovieLen, and Time Movie Ratings. This dataset is available in many formats such as CSV file, text file, and databases. We can either stream the data live from the websites or download and store them on our local file system or HDFS. Spark streaming is used to stream real-time data from the various source like Twitter, the stock market, and geographical system and perform powerful analytics to businesses. It used for processing real-time streaming data. We use collaborative filtering (CF) to predict the ratings of users for particular movies based on their ratings for other movies. Then collaborate this with another user’s rating for that particular movie. We train the ALS algorithm using MovieLen data and get the results from the machine learning model. We use spark SQL’s data frame, dataset, and SQL service to store the data. The result of the machine learning model is stored in RDBMS so that the web application can display the recommendation to a particular use. The results of the movie recommendation system are stored in our local drive. We store the recommendation movies along with the ratings in a text file and CSV file formats. We prefer storing the result into an RDBMS system so as to access it directly from the web application and display recommendation and top movies as shown in Fig. 2.

Fig. 2
figure 2

Proposed movie recommendation system using CF with ALS

3.2 Proposed System Steps

This subsection provides the steps of applying the ALS algorithm on MovieLens datasets for train and test the selection of best parameter when building a movie recommendation system.

4 Experimental Study

This section presents the experimental setup and results in discussion and analysis.

4.1 Apache Spark

Apache Spark [12] is a rapid and general-purpose cluster computing system. It introduces high-level application programming interfaces (APIs) using programming languages such as Java, Python, Scala, and R, and has an engine that supports general execution graphs. It also supports a good set of higher level tools involving Spark SQL for structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for real-time applications. It was built on top of Hadoop and MapReduce and extends the MapReduce Model to efficiently use more types of computations. Spark application runs as a separate set of process on the cluster. All of the distributed processes are coordinated by a SparkContext object in the drive program. SparkContext connects to one type of cluster manager (standalone/Yarn/Mesos) for resource allocation across clusters. Cluster manager provides executors, which are essentially JVM process to run the logic and store application data. Then the SparkContext object sends the application code (jar files/python scripts) to executors. Finally, the SparkContext executes tasks in each executor.

4.2 Data Preprocessing

The dataset which is used in this work is MovieLens dataset. This dataset contains 24 million ratings and 670,000 tag applications applied to 40,000 movies by 260,000 users. This dataset contains three files called ratings.csv, movies.csv and tags.csv. ratings.csv contains tree column (userId, movieId, rating). While movies.csv contains movieId, title, genres. The genres have the format: Genre1, Genre2, Genre3. The tags file (tags.csv) has the format: userId, movieId, tag, timestamp and finally, the links.csv file has the format: movieId, imdbId, tmdbId. We can split the data into three portions which are training, validation, and test data to parse their lines once they are loaded into RDDs. Parsing the movies and rating files yields two RDDs: For each row in the ratings dataset, we have created a vector of (userId, movieId, rating). During preprocessing, we have dropped the timestamp attribute because we do not need it for this recommender. Similarly, each row in the movies dataset, we have created a vector of (movieId, title). We have dropped the genres attribute because we do not use it for this recommender.

In order to determine the best ALS parameters for our experiments, we need to break up the ratings RDD dataset into three pieces as follows: a training set which we will use 60% of the data to train models, validation set, which used 20% of the data to choose the best model and test set, which used 20% of the data for our experiments to randomly split the dataset into the multiple groups.

4.3 Experimental Environment

The test has been done on a machine which contains the subsequent descriptions P. A machine with Ubuntu 14.04 LTS, 4 GB memory, and Intel® Core™ i5-2400 CPU @ 3.10 GHz × 4 processor as well as a hard disk of 500 GB. In this machine, Apache Spark with version 2.1.1 is installed and is used to develop the proposed system. The dataset which is used in research work is MovieLens dataset [13]. In the proposed model, root mean squared error (RMSE) is used as a performance measure. RMSE works by measuring the difference between error rate a user gives to the system and the predicted error by the model. Equation (1) depicts how RMSE works on movie recommender system.

$$ {\text{RMES}} = \sqrt {\mathop \sum \limits_{i = 0}^{n} \frac{{\left( {x_{ui} - y_{gi} } \right)^{2} }}{n}} $$
(1)

whereby xui is the rating that user u gives to an item i in the experimental data, ygi is a predicted rating that the movie that user u gives to an item and where n is the number of ratings in the test data.

4.4 Experimental Results Analysis and Discussion

Recommender system (RS) is becoming growingly popular. In this work, Apache Spark is used to demonstrate an efficient parallel implementation of a collaborative filtering method using ALS. ALS is used for dimensionality reduction purpose which helps in overcoming the limitations of collaborative filtering such as data sparsity and scalability. The challenges of data sparsity are appearing in numerous situations, specifically, another problem, when a new an item or user has just added to the system, it is difficult to find similar ones since there is no sufficient information, this problem is called cold start problem [14, 15]. When selecting the ALS algorithm as a part of building the proposed movie recommender system, there is basic parameter through them can determine the best rating of users for given movies. These parameters are Rank, Iterations, and Lambda.

The contribution of this paper is to study and determine the selection of parameters that affect the performance of ALS model in building a movie recommender system because from literature study, it is found that little research work focused on the study of the selection of ALS’s parameters that can affect its performance in building a movie recommender engine using Apache Spark. The parameters, lambda, and iterations are used in order to control and adjust the predicting capability of matrix factorization which is depending on ALS technique which in turn affect the evaluation of movie RS. The iterations and lambda parameters are used as follows: Lambda which specifies the regularization parameter in ALS and iterations in which the proposed model should run the specified number of iterations. The ALS algorithm achieves its optimal solution between 5 and 20 iterations.

The parameters lambda and iteration in ALS model are used with different thresholds to realize the effects of matrix factorization performance on the performance of recommendation results and thus take the most appropriate parameters for the following test setups. Tables 1, 2 and 3 show the performance of movie recommendation engine based on ALS under different values of lambda and iteration. Table 1 illustrates the execution of time with the changes of lambda with iterations parameters of ALS model, while Table 2 the rank of best-trained model with the changes of lambda with iterations parameters of ALS algorithm, finally Table 3 indicates the RMSE with the changes of lambda with iterations parameters of ALS model. The results presented in Table 1 indicate that when lambda is set to 0.6 and iteration set is 10, the time value is minimum which is 1.41323 s, and rank value is 8 as shown in Table 2. Moreover, the RMSE register for this rating is 1.07424 as indicated in Table 3. On the other hand, as it is indicated in Table 1 when lambda is set to 0.2 and iteration is 15, running time becomes 1.463743 and rank is 12 for this item as shown in Table 2. The RMSE value for this item is the minimum, which is 0.9167, as presented in Table 3.

Table 1 Time of matrix factorization using lambda and iteration parameters
Table 2 Rank of matrix factorization using lambda and iteration parameters
Table 3 RMSE of matrix factorization using lambda and iteration parameters

As mentioned above, the analysis for movie recommendation system is done using three quality metrics which are RMSE, time, and rank. Using these three metrics, two cases are achieved as shown in Table 4, case 1 with high time and low RMSE rate while the case 2 with low time and high RMSE rate. According to results in Table 4, the prediction for Top 25 movies is shown in Figs. 3 and 4.

Table 4 Two cases for selecting parameters for ALS
Fig. 3
figure 3

Prediction of top 25 movies for case 1

Fig. 4
figure 4

Prediction of top 25 movies for case 2

In general, the lowest value of the RMSE is considered the best case for prediction in building recommendation system. Therefore, we will adopt the second case because the value of the RMSE is smaller compared to the value in the first case as well as adopt the second case as the best case because there is no significant difference in the amount of time execution between the two cases. Now, we can get the top recommended movies by using the second case. Finally, we concluded that from these results the best case is the second case which has the best value for RMSE, which can be useful for building recommendation engines for predicting the top 25 ranked movies.

5 Conclusion and Future Work

Movie recommender system plays a significant role in identifying a set of movies for users based on user interest. Although many move recommendation systems are available for users, these systems have the limitation of not recommending the movie efficiently to the existing users. This paper presented a movie recommender system based on collaborative filtering using Apache Spark. From the results, the selection of parameters of ALS algorithms can affect the performance of building of a movie recommender engine. System evaluation is done using various metrics such as execution time, RMSE of rating prediction, and rank in which the best model was trained. Two best cases are chosen based on best parameters selection from experimental results which can lead to building god prediction rating for a movie recommender engine. From these cases, the lowest value of the RMSE is considered the best case for prediction in building movie recommendation system. Therefore, the second case is recommended to be used since the value of the RMSE is smaller compared to the value in the first case as well as adopt the second case as the best case, because there is no significant difference in the amount of time execution between the two cases. Finally, we concluded that from these results that the best case is the second case which has the best value for RMSE, which can be useful for building recommendation engines for predicting the top 25 ranked movies. In the future work, we plan to develop and improve a new loss function because of the shortcomings of the recommender system algorithm based on ALS model based on the parameter of the best case which has the best value for RMSE using Apache Spark.