1 Introduction

The recommendation system is currently popular and applied in many industrial projects, as well as in many different domains [1]. The most common but severe feature of these recommendation systems is to understand what users are thinking and expect. As an example of a store with millions of products on display, the recommendation system relies on the explicit user behaviors (e.g., reviews and ratings) and implicit ones (e.g., clickstreams) in order to provide their customers with the most relevant product.

However, how can we assure whether the recommendations of the system yield the desired results? If the system does not understand what users think about each product, it will never be able to give them the best experience. Therefore, we consider that understanding the opinions and feedbacks of the users is important and essential so that the system can understand and make suggestions that are accurate and suitable for their needs.

In particular, we focus on the movie domain, and we realized that there are many ways to get feedback from users who have a lot of experience in watching movies. Most of movie recommendation systems (e.g., IMDB, MovieLens, Netflix, and so on) collect information by asking users to rate movies and to write an opinion about personal feelings. However, many users do not want to write their conclusion carefully because it takes their time. Some of the users do not know how to write their opinion clearly and thoroughly.

We want to have a new approach to collect cognitive feedbacks from users by understanding their cognition in selecting similar movies. Then, we can recognize and predict the similar movie selection patterns of the users. Once we have such a large amount of feedback, we can use the statistical analysis methods to understand why users think these two movies are similar? To answer that question, we aim to build a crowd sourcing platform [2] to collect a large number of datasets on user feedback, as mentioned above. We propose a crowd sourcing-based recommendation platform (called OurMovieSimilarityFootnote 1) and the movie data to create sample information for user select have collected automatically by using IMDB open database, or users add themselves to the OMS through personally suggest function.

The problem is that the system has to interact with users efficiently and to be easy to use. We decided that the OMS’s layout and the process must be exciting and straightforward [3]. The data collection process described as follows:

  • Firstly, the system’s new users will choose a movie they have watched and favorited.

  • Second, OMS will suggest a list of 5 similar movies (calculated by the system and accompanied by an α parameter that is the user’s selection trend, this parameter has initially defaulted 1).

  • In the final step, users will select a movie in 5 movies appearance that they think it is similar to the one they chose at first step.

The movie selection will be a loop with variable \(\omega\) parameters dynamic closer to the user’s selection trend. The calculation to suggest a movie and the addition of \(\omega\) parameters helps users improve the time to choose the movie similar, and they do not have to do it multiple times to find a similar of a movie \(m_{i }\) and a movie \(m_{j }\). The data collected will be applied in movie similarity recommendation, as we described in Sect. 4.2.

The outline of this chapter includes five sections. Section 1 is for the introduction, and Sect. 2 discusses the related work on crowd sourcing platforms. In Sect. 3, we address a cognitive approach in the recommendation system. We describe experiences on OurMovieSimilarity platform include the overview, functions of OMS, and the data analysis in Sect. 4. Finally, in Sect. 5, we conclude and make some remarks.

2 Related Work on Crowdsourcing Platforms

In 2006, in the article “The Rise of Crowdsourcing” which Jeff Howe published in Wired [4], the term crowdsourcing appearance the first time, which is a contraction of crowd and outsourcing. This quote from the article describes simply what crowdsourcing is and how it’s made possible by technological advances. The meaning of the word crowdsourcing itself is a portmanteau of crowd and outsourcing [5, 6]. From an online application perspective, online crowdsourcing platforms are increasingly being used to capture ideas from the crowd. Global companies are adopting crowdsourcing ideas to connect with and get feedback from the users. The success of a crowdsourcing platform largely depends on members and their motivation to participate. Motivation determines the quality and quantity of contributions [7].

Crowdsourcing has become a useful tool for understanding audience preferences and anticipating needs [8], which are very important for businesses that depend on innovating or enhancing their products such as fashion brands, food manufacturers, or restaurants [2]. For example, Starbucks provides a simple fill form where people can suggest new ideas, enhance existing services, or request product deliveries. More complex, Amazon MTurk and CrowdFlower [9], which are crowdsourcing platforms, have several jobs such as data categorization, metadata tagging, character recognition, voice-to-text transformation, data entry, email harvesting, sentiment analysis, ad placement on videos, or surveys.

ResearchGate [10] was founded in 2008 by Ijad Madisch, who aims to transform the way researchers are doing their research [11]. For the first time, ResearchGate’s headquarters started in Boston and is currently based in Berlin, Germany, and invested by several US venture capital companies. Researchgate now has more than 7 million members., with an average of seven researchers signing up per minute. This system is a crowdsourcing platform for scientists and researchers to ask, answer the question, finding collaborators, and share papers. Although reading an article does not require registration, those who want to become a website member need to have an email address at an accredited or manually verified organization that is a published researcher to register. Each member of the site has a user profile and can upload research results, including papers, data, chapters, adverse effects, patents, research proposals, methods, presentations, and software source code. Besides, users can also follow the activities of other users and participate in discussions with them. Users can also block interactions with other users.

ResearchGate’s success has allowed researchers to disseminate their ideas and share their publications for free so they can create collaborations between researchers from around the world. Through this crowdsourcing flatform, members can maintain their publications, ask and answer research-related questions, and follow other researchers to receive their publication updates. This system is capable of tracking the activities of researchers identified as related to each other (such as co-author of the article, project, chapter, and so on.) to send automatic notification emails about updates that associated members. And in fact, this system also has ways to summarize the joint activities of a research group to assign a unique email to limit the number of duplicate emails sent to each researcher.

3 A Cognitive Approach to Recommendation

The first question we asked was, “why users think two movies are similar?”. Recent advances in psychology and cognitive sciences support the notion that people use a dual-process model, whereby a sense of similarity built on a combination of feature-based taxonomic and relationship-based thematic relations [12]. Taxonomic or hierarchical relationships based on internal characteristics, such as the features of the items themselves, while a thematic relationship is external, which is a separate event or scene that connects the two items. For example, pen and pencil are taxonomically similar since they share many features: both using for writing and under the category of “stationery”. Pen and papers are thematically similar since they often used during the same event.

Nowadays, people have witnessed a massive increase in electronic service, e-commerce, and e-business applications operating on the Internet [13]. Recommended systems are increasingly being used by application vendors to make recommendations to their customers [14]. However, most traditional recommendation systems primarily focus on extracting and recommending general priorities based on historical user data [15]. Although users, in general, shared interests can be considered relevant, an individual user also has his personal preferences. He/she can also answer domain expert knowledge to some extent to make a decision. Moreover, very often, while using traditional recommendation systems, users do not readily distinguish whether the items on the page are actual recommendations or merely the content of the page is displayed indiscriminately for all users. Therefore, traditional recommendation systems do not give customers the feeling of being treated individually. Jeff Bezos, CEO of Amazon.com, concluded that if I had 3 million customers on the Web, I should have 3 million stores on the web [15]. Personalized proposal agencies are emerging to overcome the personal nature of integrated recommendations by using technology to assist customers in making decisions about how to treat each customer individually [16].

4 Experiences on OurMovieSimilarity Platform

4.1 Overview

We create OurMovieSimilarity (OMS) platform, which is the crowdsourcing system to collect similar movies from a large amount of users’ cognitive data in the selection of similar movies. OMS collects data of movies from the data warehouse provided by IMDB and allows users to have the experience of choosing similar movies from the movies they have seen and also their favorite ones. Figure 1 shows the main features of the OMS system: data collection from the internet and processing of data obtained from users’ cognitive of similar movies.

Fig. 1
figure 1

The overview of the OMS crowdsourcing platform

OMS system is built based on Java programming language, combined with the MySQL database. OMS contains two services, which are web service and background service. The system designed by the Model-View-Controller (MVC) model to ensure security and many flows to handle multiple access and tasks with extensive data. We implement the web service side with Apache Tomcat Servlet Container and Java Server Pages technology. For connecting two services, we choose the Apache Thrift framework. The main features of the OMS system: data collection from the internet and processing of data obtained from users’ cognitive of similar movies. Specifically, we have two methods of collecting data about movies, first is to collect automatically from the IMDB database, users add second. Then, OMS collects data from users’ selection about the similarity of the movie. Finally, we have a new data set of user perceptions after performing the analysis of the collected data.

The primary purpose of OMS is focussing on simple and efficient. However, OMS is a web-based crowdsourcing platform, and therefore, the problem of loading data from the system to display to users must ensure the lowest latency. Besides, a big problem for a web-based application is making sure that the user has enough instructions to operate the entire system. To solve the issues which we mentioned above, we apply the concept of progressive disclosure to OMS: “show users what they need when they need, and where they want” is the main idea of it. During this period, we focused on increasing the amount of feedback from users most efficiently and accurately. To keep the system running smoothly, and improve the user experience, we take an approach for users to interact simplest and receive the fastest response from the system.

Notably, we apply one template for all pages to maintain the consistency of the user interface. Hence, similar signs will be easily recognized by users (e.g., buttons, functions) while they operate on OMS. We are also applicable this thought to all actions occurring on the system. That is, with all the different steps, we all guarantee the same interface behavior. All of the features we mentioned above based on three gold rules of user interface design [17].

4.2 How Can OMS Interact with Users Intelligently

After users select movies which they have seen and favorite, OMS will calculate and suggest the list of 5 movies similar to their selection. Given \(m_{1}\) is a movie which users have seen and \(m_{2}\) is one of all movie remain in our database, we aim to calculate score \(Sim\left( {m_{1} ,m_{2} } \right)\) by making a score after comparing each feature of two movies. The formula for scoring \(Sim\left( {m_{1} ,m_{2} } \right)\) described as follows:

$$Sim\left( {m_{1} ,m_{2} } \right) \equiv \left\langle {T,~G,D,A,P} \right\rangle$$
(1)

where T, G, D, A, represents the feature of comparing between tile; genre; director; actors; and plot of a movie \(m_{1}\) and a movie \(m_{2}\). We repeat that calculation for all the remaining movies in the OMS and obtained a set \(\left\{ {Sim\left( {m_{1} ,m_{2} } \right); Sim\left( {m_{1} ,m_{3} } \right); \ldots ;Sim\left( {m_{1} ,m_{n} } \right)} \right\}\). Then, we add an element that is the trend of users while they select similar movies. Hence, the formula has rewritten as follow:

$$Sim\left( {m_{1} ,m_{2} } \right) = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \omega_{i } \times Sim_{i} \left( {m_{1} ,m_{2} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{n} \omega_{i} }}$$
(2)

where \(\omega_{i}\) are user’s trend of selecting similar movies based on \(i\) whicha number of features: title, genre, director, actors, and plot, and \(Sim_{i}\) is a score between movie \(m_{1}\) and movie \(m_{2}\) based on each feature. To perform the making score of \(Sim_{i}\) after comparing the similarity of each feature, we use a string comparison algorithm base on the Jaccard similarity coefficient (originally coined coefficient by Paul Jaccard) [18]. We do the calculation by identifying the union (characters in at least one of the two sets) of the two sets and intersection (characters which are present in set one which is present in set two). To generalize the calculation formula, we define the following:

$$Sim_{X} \left( {m_{1} ,m_{2} } \right) = \frac{{\left| {X_{{m_{1} }} \cap X_{{m_{2} }} } \right|}}{{\left| {X_{{m_{1} }} \cup X_{{m_{2} }} } \right|}}$$
(3)

where X is a feature of each movie \(m_{1}\) and movie \(m_{2}\). In particular, the score of titles features and described as follows.

$$Sim_{T} \left( {m_{1} ,m_{2} } \right) = \frac{{\left| {T_{{m_{1} }} \cap T_{{m_{2} }} } \right|}}{{\left| {T_{{m_{1} }} \cup T_{{m_{2} }} } \right|}}$$
(4)

where \(T_{{m_{1} }}\) represents the title of the movie \(m_{1}\), and \(T_{{m_{2} }}\) represent the title of the movie \(m_{2}\). We perform this calculation for the features: title, genre, director, actor. In case of making a score by comparing the plot of the movie \(m_{1}\) and movie \(m_{2}\). Because, it is not a regular string to compare, we pre-processing first. So, we present Term Frequency–Inverse Document Frequency (TF-IDF) to select high weight terms from the plot of the movie. First, we count the number of times each word occurs in the plot of the movie following the definition of Hans Peter Luhn (1957): “The weight of a term that occurs in a document is simply proportional to the term frequency” [19]. Number of times that term occurs in the plot of movie denoted by raw count. Second, because some term the; a; an, and so on are common and tend to incorrectly emphasize scenario which happens to use this term more frequently without giving enough weight to the more exact another time. Hence, following the definition of Karen Sparck Jones: “The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs” [20] we use inverse document frequency (IDF) measure how much information the term provides.

Finally, when we use TF-IDF combining with the defined threshold for filtering the useless term \(\partial\), we can find the high weight term frequency in the plot of the movie. So, the formula is shown as follows:

$$P_{{m_{i} }} = \left\{ {t | \smallint \left( {t,p_{{m_{i} }} } \right) \times \log \frac{N}{{\left| {\left\{ {p_{{m_{i} }} \in M :t \in p_{{m_{i} }} } \right\}} \right|}} \ge \partial } \right\}$$
(5)

where \(P_{{m_{i} }}\) represent all high-frequency term t in a plot of the movie \(m_{i}\); t is a term that occurs in the plot of a movie \(m_{i}\); \(p_{{m_{i} }}\) is all term in the plot of a movie \(m_{i}\); N is the number of total plots; M is the whole term in all plots remaining; \(\partial\) is the defined threshold for filtering useless term.

By applying TF-IDF, we have obtained for each film a plot of high-meaning terms. So that the comparison between two similar films according to plot feature has more accuracy. Of course, this function wouldbe done entirely automatically, and it means when we have a new movie added to the OMS system, all data, when retrieved, are recalculated to meet the highest accuracy. Beside, OMS will auto dynamic show the suggest movies to users for selection based on the trend of their choice. For example, when user select movie \(m_{1}\) is “have seen”. In the first time, we use our calculating to show movies for users’ selection, and the OMS will show a set \(\left\{ {m_{i} | i \in \left[ {2,.., 6} \right]} \right\}\), and users perhaps have some refresh to select a movie similar. However, after some round limited, we can predict the trend of users’ selection described in Fig. 2; and when a user select movie \(m_{1}^{'}\) is “have seen” we will show \(\left\{ {m_{i}^{'} | i \in \left[ {2, \ldots ,6} \right]} \right\}\) for a user to select a movie that is more similar than \(m_{1}^{'}\). At this time, showing movies will be dynamically calculated and displayed depending on the users’ selection trend. Hence, a user quickly to select similar movie after changing suggested movies one or two times.

Fig. 2
figure 2

The trend of the user’s selection

Considering the aspects of the relevance of the users in our system, we group related users through the analysis of each user’s data. Suppose, users \(U_{i}\) and \(U_{j}\) focus on the priority features to choose the movie similarity in the order <Title, Genre, Actor, Director, Plot > ; the system identifies these two users related. Otherwise, the user \(U_{i}\) focuses on <Title, Genre, Director, Actor, Plot > and \(U_{j}\) focuses on <Title, Genre, Plot, Director, Actor > , the system thinks that these two users are likely to be related together. In this case, the system will calculate and assesses the relevance of these two users for grouping. Therefore, we use soft cosine to measure the similarity between two users. The formula to calculate the related between two users as following:

$$Sim \left( {U_{1} ,U_{2} } \right) = \frac{{\mathop \sum \nolimits_{i,j}^{F} s_{ij} U_{{1_{i} }} U_{{2_{j} }} }}{{\sqrt {\mathop \sum \nolimits_{i,j}^{F} s_{ij} U_{{1_{i} }} U_{{1_{j} }} } \sqrt {\mathop \sum \nolimits_{i,j}^{F} s_{ij} U_{{2_{i} }} U_{{2_{j} }} } }}$$
(6)

where \(s_{ij}\) is the similarity of each feature between two users; F is the number of user’s features. From that assumption, we do user grouping and use the data of users in the same group to recommend their chosen movie pairs. For example, user \(U_{i}\) selects the movie (\(m_{i}\), \(m_{j}\)) the same and user \(U_{i}\) is related to user \(U_{j}\), the system will suggest that user \(U_{j}\) respond to pair of movie (\(m_{i}\), \(m_{j}\)) whether or not? The user grouping is shown in Fig. 3.

Fig. 3
figure 3

The related of users based on their trend in selecting movie similarity

4.3 Statistical Analysis of the Data (Cognitive Feedbacks)

There are many sources to gather information about movie data provided online. However, we identify an IMDB is an extensive and highly scalable movie database. We collected over 20,000 popular movies from 1990 to 2019 with nine genres and continue importing from open movie database IMDB, 3439 directors, and 8057 actors. The statistics of movie data in OMS is described in Table 1.

Table 1 Statistics of movies data in OMS

Our system now online and continue collecting data, at this time, we have over 50 users and already raised more than 1000 pairs of movie similarity from users’ activities.

The number of data collected from users is shown in Fig. 4, and has the format: \(\left( {U_{i} , m_{j} ,m_{k} ,Sim\left( {m_{j} ,m_{k} } \right),\vartheta_{i} } \right)\) inside \(U_{i}\) is the id of the user; \(m_{j}\) and \(m_{k}\) are a pair of the movie similar; \(Sim\left( {m_{j} ,m_{k} } \right)\) is a similarity score we calculated; and \(\vartheta_{i}\) is the number of times user change suggested movies in select a pair of movies similar.

Fig. 4
figure 4

Statistics amount of data feedback from users by day

We have also done user grouping, but with the number of users in this experiment, the user grouping is too fragmented. We aim to reach a large number of users so that clustering will enhance more efficiency. Figure 5 shows a graph representing several typical user groups in our system.

Fig. 5
figure 5

Clustering of users with T, G, D, A, and P represented sequence title; genre, director; actor; and plot

5 Concluding Remarks

Our study focuses on understanding the cognitive of users and answers the question, “what makes users think two movies are similar?”. To do that, in the first stage, we deploy OurMovieSimilarity as a crowdsourcing platform to collect cognitive data from a crowd of users.

Besides, OMS has performed the calculation of data collected by each user to analyze and predict the trend of the user’s selection. The data collected shows that the number of page refresh in choosing a pair of movies similarity reduced for each user. This helps to minimize the time, boring, and increase the ability to collect user feedback. From the approach to interacting with such a user, we deduce that people will be motivated to contribute feedback to help others, as well as discover for themselves.

OurMovieSimilarity is still deploying online and continuing to collect data. We plan to have at least 5,000 users and 20,000 responses to make more accurate analyzes of user cognitive in choosing the movie similarity. We release dataset representing user perception of similarity between two movies at the OMS website.