Keywords

1 Introduction

With the rapid development of digital technology and network technology, the domestic and foreign appear many big video websites. After more than ten years of exploration and practice, major video sites (domestic video sites such as iQIYI, Tencent, Youku, foreign video sites such as YouTube and Netflix, etc.) have already occupied a majority position in the field of video website. And watching the video has been integrated into people’s daily life. Nevertheless, it is an important problem that how to timely recommend video that the users may be interested, moreover it also becomes a key problem that needs to solve actively for each big video website.

In recent years, video that the users can send the barrage has become the mainstream [1], and researchers pay more and more attention to the barrage video. The barrage text sent by the users can reflect the feelings of users about the current segment of video, and the number of barrage can reflect the degree that the current segment of video is favored by users. Therefore, this paper studies the number of barrage in the video to improve the accuracy of video recommendation.

At present, a larger number of domestic researches are based on the analysis of barrage content. For example, according to the topic model to classify the word of barrage, the literature [2] combined with the theme distribution of each word and emotional dictionary and proposed the algorithm that can measure the emotional vector of dynamic evaluation words. It uses the “global + local” context-related emotion similarity calculation method to calculate the emotional similarity, and finally obtain the recommended video segments by the emotional similarity scores, however, the emotional classification of bullet screen content without emotional color or irony is a little insufficient.

The literature [3] firstly processed the barrage text data, calculated the emotional value of these texts, improved the traditional k-means algorithm, calculated the distance between the data by using the dynamic time normalization algorithm. And then, the emotional values are classified so as to distinguish the similarities and differences in users’ emotions when they watching video, however, the literature only calculates the positive and negative emotions of the users who send a large amount of barrage data, and cancels the users who send a small amount of barrage data, so it is impossible to classify the users who prefer this video but send a small amount of barrage data. In the emotional value analysis of barrage text, there is a problem that it is seriously relying on the barrage content, therefore this paper proposed a video recommended model that is based on the number of barrages and combined with the recursive convolutional neural network. The model uses the number of barrages to analyze the fragments favored by users in video, and adopts the K-means clustering method to extract the key frames of video clips. Furthermore, the key frames can be processed to the valid static images that can be used to the input of the recursive convolutional neural network model. Finally, the important human behavior characteristics can be extracted, and on this basis the method can be recommended to the users by looking for video with similar video frames.

The remainder of this paper is organized as follows. In Sect. 2 we describe some related studies on video recommendation and video recommendation based on convolutional neural network. Section 3 defines the problem of video recommendation on the basis of barrage. Section 4 formally describes the work flow of video recommendation that contains the data preprocessing and RCNN model. Section 5 illustrates the data sources and experimental sets, and gives the experimental results. Finally, Sect. 6 concludes this paper and explains the next researching direction of video recommendation based on barrage.

2 Related Works

This section introduces the related research results of video recommendation and video recommendation method based on convolutional neural network.

2.1 Traditional Video Recommendation

Traditional video recommendations are generally divided into content-based video mixed recommendation method and collaborative filtering based video mixed recommendation method.

  1. (1)

    Content based video recommended methods: the literature [4] proposed a recommendation method based on label weight score that each label is set the corresponding score, where the score represents the weight of the item or the user on the label, in order to reduce the influence of objective factors on user ratings and improve the accuracy and authenticity of the score. On the basis of context, Lin [5] proposed a recommendation model based on user decision that selects the required features or feature combinations according to the factors affecting user decision, and takes the preference of users’ interests as the direct influencing factor of decision. Tzamousis [6] uses various machine learning algorithms to learn the efficient combination method of various recommendation algorithms, and can select the best hybrid method according to the given input. This method can be easily extended to the other recommendation methods.

  2. (2)

    Video based on collaborative filtering recommendation method: Nguyen [7] proposed a probability model that combines the displayed and hidden feedback. The method firsty adopts the users and projects based the displayed feedback to represent the matrix decomposition model, secondly adopts the embedding model of projects to find the representation of goods, so as to capture the relationships among the projects based on hidden feedback. In order to solve the problem that a single user is only interested in some fields, Zhang [8] proposed a collaborative clustering recommendation method that adopts clustering algorithm to group the users and items according to their interests or characteristics, and then makes corresponding recommendations on the basis of each group. Zhang [9] improved the original frequency-based and ranking-based information kernel extraction methods, and proposed a method that considers the similarity between users and fully utilizes the scoring information of users and goods when seeking the list of neighbors for optimizing the process of finding the most similar neighbors.

2.2 Video Recommendation Based on Convolutional Neural Network

Li [11] proposed a method that computes video correlation directly from the content, rather than as the sub conditions of user behavior matrix decomposition. The method uses the deeping convolution neural network to process the video information (such as the pixels, audio, subtitles and metadata) in order to build a video link table, and reduce the behavior requirement of the new video user. In addition, in order to improve the applicability of the convolutional neural network, the literature [11] uses the condition convolution to extraction the behavior features of users. The method introduces the feature vector of goods as the convolution kernel and does not set training parameters. Moreover, this method only needs one convolution layer to get the higher order combination between N attributes of user and N attributes of items. On this basis, the method integrates user and item attribute information in the neural collaborative filtering.

Cai [12] proposed an improved recommended method that combines the matrix decomposition and cross channel convolution neural network, and add the influence factors of the user and the project to the traditional matrix decomposition. Then, the information matrix composed of word vectors is input into the convolutional neural network. Finally, the eigenvalue of evaluation information is combined with the regularization term of the improved matrix decomposition model. The literature [13] applies the dynamic convolution probability matrix decomposition model to group recommendation, integrates the text representation method of convolutional neural network into the potential factor model, and integrates the state space model into the potential factor model.

3 Problem Formulation

This section introduces the related background of this paper, including the characteristics and mechanism of barrage video, and the relevant definition s. When a user sees a scene in video, he or she may write some texts and send it to the video in the period of time. The other users may send the texts to this video when they watch the same scene of video later. So when the number of barrage in this time is more than a certain number during a period of time, and it can be concluded that those users who send the video barrage are more interested in this scene. By analyzing these interested video segments, we can find the similar video in order to recommend to the users. The above questions can be summarized as Definition 1.

Definition 1

(The video recommendation model based on barrage). Given the set of video is \(V=\{v_{1},v_{2},\cdots ,v_{n}\}\), where |V| denotes the number of video in V and the time of the video \(v_{i}\) is \(T_{v_{i}}\), the set of users is \(C=\{c_{1},c_{2},\cdots ,c_{m}\}\), the video set contains the number of barrages in various times for n video \(v_{i}\) is \(d_{i,t_{j}}\) and the number of views is \(g_{ij}\) et al. The model firstly takes the time point \(t_{j}\) when the barrage numbers \(d_{ij}>\lambda \) (\(\lambda \) is obtained after testing) in the video set \(v_{i}\). According to the \(t_{j}\), secondly extracts the video fragments \(f_{i}=\{q_{i,1},q_{i,2},\cdots ,q_{i,|v_{i}|}\}\), where \(q_{i,k}\) denotes the k-th video segment in the i-th video. Next, uses K-means clustering scheme to extract the key frame in the extracted fragment set \(f_{i}\). The extracted data set S is processed to the structured data set D by introducing the recursive convolutional neural network model for improve the accuracy \(\varepsilon \) of the recommendation system. The mathematical model of the problem is shown in Eq. (1).

(1)

4 Video Recommendation Structure

Video recommended workflow consists of two sub-modules: (1) data preprocessing; (2) RCNN model, as shown in Fig. 1.

Fig. 1.
figure 1

The process of video recommendation

4.1 Barrage Data Preprocessing

The data preprocessing stage mainly includes three steps: sorting out the number of bullet screens, cutting video fragments, and extracting the main frame. The barrage is presented in the form of dynamic text, scroll or static, so the number of barrage in the same video playing video at the same time is the number of barrage on the same screen. The rolling time and stationary time of each barrage of Bilibili are 7 s, and it is found by test that there are few cases when the number of barrages of the same screen is more than 50 [14]. Literature [15] drew the “time-barrage” polyline graph with coordinates \(t_{j}\), \(d_{i,t_{j}}\)) at an interval of 5 s. This paper firstly counted the number of barrages \(d_{i,t_{j}}\) of each video, found the time point \(t_{j}\) of the barrage number \(d_{ij}>\lambda \) in video \(v_{i}\), and then cut video segment \(f_{i}\) according to the time point \(t_{j}\), where \(\lambda \) is simply computed in Eq. (2).

$$\begin{aligned} \lambda =\frac{1}{|T_{v_{i}}|}\sum _{t_{j}=1}^{T_{v_{i}}}d_{i,t_{j}} \end{aligned}$$
(2)

Considering the interference at the beginning and end of video, it is necessary to remove the number of barrages in some times, such as when music plays at the start and end stage. The key frame extraction algorithm based on k-means clustering method is used to extract the main frames from video fragments and select the video frames that can fully represent video characteristics.

4.2 RCNN Model

The RCNN model is composed of convolutional neural network (CNN) and recursive neural network (Elman recursive neural network). The network structure is shown in Fig. 2.

Fig. 2.
figure 2

RCNN model

CNN has three convolution layers and three pooling layers. After the network passes through the full connection layer, it does not go directly to the softmax function layer (classification layer) of the convolution layer, but directly to the added recursive neural network layer. Relu function is used as nonlinear mapping function in the network.

The construction steps of video recommendation model based on recursive convolutional neural network proposed in this paper are as follows:

  • Step 1: extract the necessary information in the spatial features of video frames that have been selected to represent video characteristics by using the convolutional neural network in the RCNN model. Then, convert the image into low-dimensional feature information, and output the low-dimensional feature information. The calculation form of convolution layer is generally shown in Eq. (3).

    $$\begin{aligned} S_{j}^{l}=f\left( \underset{i\in M_{j}}{\sum }Kernel_{ij}^{l}\times S_{j}^{l-1}+b_{j}^{l}\right) \end{aligned}$$
    (3)

    Where, the \(S_{j}^{l}\) is the result obtained by convolution at l layer, \(f\left( \circ \right) \) is an activation function, \(S_{j}^{l-1}\) is the output feature of the previous layer, \(Kernel_{ij}^{l}\) is the convolution kernel matrix, \(b_{j}^{l}\) is the bias of the feature graph after convolution, \(M_{j}\) is the number of feature inputs, that how many image features are selected as the input of the convolution layer.

  • Step 2: Input the low-dimensional feature information into the RNN model in the RCNN model. The output relation between input layer, hidden layer, connection layer and output layer of RNN network is calculated by Eqs. (4, 5 and 6).

    $$\begin{aligned} v_{i}\left( k\right) =\sum _{j=1}^{n}w_{ij}^{E}\left( k-1\right) s_{j}^{E}\left( k\right) +w_{i}^{u}\left( k\right) u\left( k\right) \end{aligned}$$
    (4)
    $$\begin{aligned} s_{i}\left( k\right) =f\left( v_{i}\right) \end{aligned}$$
    (5)
    $$\begin{aligned} s_{j}^{E}\left( k\right) =s_{j}\left( k-1\right) \end{aligned}$$
    (6)

    The output of the network \(D\left( k\right) \) is calculated by Eq. (7).

    $$\begin{aligned} D\left( k\right) =\sum _{j=1}^{n}w_{i}^{D}\left( k-1\right) s_{i}\left( k\right) \end{aligned}$$
    (7)

    Where \(v_{i}\left( k\right) \) represents all inputs of the i-th hidden layer unit, and \(f\left( \circ \right) \) represents the activation function. \(w_{ij}^{E},w_{i}^{u},w_{i}^{D}\) represents the weights from the input layer to the hidden layer, the connection layer to the hidden layer, and the hidden layer to the output layer.

  • Step 3: according to the parameters set, the loss function model \(\sigma _{j}\) in this paper uses the error back propagation algorithm for the partial derivatives of the weight and bias of each layer in the network. Then update the parameters, and at the same time adjust the feedback weights of Elman respectively. When the model reaches the maximum number \(\mu \) of iterations or loss function within a reasonable range, stop the training, or return to the second step.

  • Step 4: after ending the model train, get the feature vectors of specific video that users interests. According to these feature vectors, sort video similarity by descending. Finally, the users can be recommended some projects. In summary, the recursive neural network recommendation algorithm based on barrage is described as Algorithm 1.

figure a

5 Experimental Results

This section briefly introduces the data sources and experimental environment used in the experiment. Video of this experiment is downloaded from the domestic barrage website “bilibili” by “jiji” Down software. From the classic video set to choose multiple types of video, such as the modern comedy-Home With Kids, ancient costume comedy-Bronze teeth Ji Xiaolan, humanities war drama-Bright sword, historical costume drama-Youth bao zheng, action comedy-World for the Monkey King, martial arts love drama-The Heaven Sword and Dragon Saber By Jin Yong respectively. We choose the first 5 episodes of each drama, so the total number of episodes is 30. Since bilibili.com maintains the barrage pool and limits the amount of barrage data, so video with more than 20 min has about 3,000 barrages in per episode, and video with more than 40 min has 6,000–8,000 barrages in per episode. The results is shown in Fig. 3, where <p = “when the barrage appears at video (second), the mode of barrage (such as 1..3 denotes rolling barrage, 4 denotes bottom barrage, 5 denotes top barrage, 6 denotes the reverse barrage, 7 denotes the specified bits, 8 denotes the advanced barrage), font size (px), font color (HTML color (decimal)), the generating time of barrage (Unix format), barrage pool, the sender id of barrage, barrage id in barrage database”> barrage content.

Fig. 3.
figure 3

Barrage-list

By python program, it crawls the sending time, content and sender id of the barrage. Then, the barrage data is processed, and is visualized, as shown in Fig. 4.

Fig. 4.
figure 4

The list of processed barrage

In this paper, the number of barrage removed the barrage at the beginning and end of video, reducing the noise barrage, as shown in Fig. 5.

Fig. 5.
figure 5

The changing number of barrages

Besides data preprocessing in this paper, we set the training model and some parameters, as follows:

  1. (1)

    The operating system is Windows 7, Intel core i5 processor and 16 GB RAM.

  2. (2)

    The programming environment is python3.0, and the keras library environment is built. In order to train the weight of RCNN, the random gradient descending algorithm is used.

  3. (3)

    In order to evaluate the overall performance of the model, the data set was randomly divided into a training set(80 and the Mean Absolute Deviation (MAE) [16] is adopted as the evaluation merit, as shown in Eq. (8).

    $$\begin{aligned} M_{MAE}=\frac{\sum _{i=1}^{N}|p_{i}-q_{i}|}{N} \end{aligned}$$
    (8)

Where N is the number of recommended video, \(p_{i}\) is the predicted results of the experiment, and \(q_{i}\) is the actual results. MAE reflects the difference between the predicted value and the actual value of the algorithm. The smaller MAE value is, the more accurate the recommended algorithm is. In order to verify the effectiveness of this model, the proposed algorithm of this paper is compared with collaborative filtering thematic regression algorithm (CTR) [17], collaborative filtering deep learning algorithm (CDL) [18] and convolutional matrix decomposition algorithm (ConvMF) [19] on the same data set of this paper.

Fig. 6.
figure 6

The experiment results of model comparison

This paper uses the data set crawlled from the bilibili website, In order to reduce herd effect in video barrage, we preprocess the data, the experimenta l results obtained in the experiment test set are shown in Fig. 6.

From the figure, we can see that the performance of proposed method compared with the CTR, CDL, ConvMF respectively increases by 0.22, 0.18 and 0.31. From the comparison results in this data set, the proposed methode in this paper owns higher accuracy.

6 Conclusion

In order to improve the accuracy rate of the recommendation model, this paper proposes a video recommendation method that adopts the convolutional recursive neural network based on barrage. The barrage video can quickly capture the preference features of users by barrages, and improve the overall recommendation performance of prediction model with the help of convolutional recursive neural network. Experimental results show that the proposed convolutional recursive neural network recommendation method based on barrage can increase the selectivity of users. However, video without motion features has a weak recognition ability. Therefore, the next research based on the video key frame is to add video content-related descriptive statements, and improve the recursive neural network [20,21,22] for further improving the accuracy of the prediction model.