Keywords

1 Introduction

Over the past decade, the streaming media services have experienced unprecedented growth. Recommending specialized types of content for customers has become an indispensable ability for streaming sites, which is why automatic labeling has attracted increasing attention in recent advances. Especially, the task of Movie Genre Classification (MGC), which is an important branch of automatic labeling and has a wide range of applications (e.g., organizing user videos from social media sites, correcting mislabeled videos, and recommending specific types of films for users), has been paid significant efforts by existing work. Specifically, the task aims to classify movies into different genres and is suffering from new challenges, due to the emerging of more and more movie-related multi-modal information, and the diverse demands of consumers.

Despite the significant contributions on multi-modal data-based MGC made by existing work [1, 6, 14, 25], they usually fuse the features of different modalities via concatenation [4, 27] or weighted sum [1, 8, 25], failing to capture the semantic information contained by multi-modal data effectively. Additionally, the existing studies ignore the movies’ metadata (e.g., directors and actors) that is of critical importance to a high performance MGC method. By way of illustration, given a movie and its sequels, they usually share the same directors or main actors and are more likely to have the same genres, compared with other different movies. This information can be effectively exploited by constructing a movie graph and extracting structural features from it. In a nutshell, there remains great scope for further improving the performance of existing MGC approaches due to the following problems: Problem 1) the multi-modal fusion strategies of existing studies cannot effectively explore the semantic information of multi-modal data; Problem 2) the movies’ metadata, which involves abundant structural information, has been ignored by existing work.

Having observed the limitations of above-mentioned studies, we propose a novel model namely MFMGCFootnote 1 that is composed of two modules: MDF (Multi-modal Data Fusion) and MGRL (Movie Graph Representation Learning). In detail, the module MDF is designed to address Problem 1). Different from most existing studies that rely on late fusion strategies, MDF utilizes the attention mechanism to fuse multi-modal data during the feature extraction process. To be specific, there are two main attention layers in MDF, which are used for exploring the semantic features contained by movie-related multi-modal data. Inspired by VLBert [22], which takes the embeddings of both words in a sentence and region-of-interest (RoI) from images as inputs and utilizes the Transformer encoder to model dependencies among all the input elements, the first attention layer feeds the text and video frames into the Transformer encoder for text-video feature extraction. Then, the second modal attention layer is designed to fuse features of different modalities. In addition, the module MGRL is developed to tackle Problem 2). A movie graph is constructed based on the overlap of directors, screenwriters, and actors. Each movie is represented as a node in the graph and has multi-modal representations, which are obtained by fusing the movie-related multi-modal attributes with the module MDF. Next, a Graph Convolutional Network (GCN)-based architecture is applied to capture the structural information between movie nodes. Ultimately, a classification layer is employed to predict the genres of movies.

To fully evaluate the effectiveness of our proposed model MFMGC, the extensive experiments on real-world datasets are very essential. However, most of datasets used in previous studies are either not publicly available or have incomplete data [5, 18]. Consequently, apart from the open dataset Moviescope [8], we construct a new multi-modal movie dataset called MovieBricks1 from Douban, which is the most active online movie database and review platform in China. Specifically, the dataset MovieBrick contains 4063 European and American movies released from 2000 to 2019.

The contributions of this paper are summarized as follows:

  • We propose a novel model MFMGC to further improve the performance of existing work on the task MGC, by fully exploring the semantic features involved in movie-related multi-modal data and the structure information between movies.

  • Two modules are developed in MFMGC, i.e., MDF and MGRL. The module MDF is designed to tackle Problem 1), by capturing the interactive information between different modalities with novel fusion layers. The module MGRL is developed to address Problem 2), by extracting structure information from the movie graph that is constructed based on the overlap of movies’ directors, screenwriters, and actors.

  • We conduct extensive experiments on two real-world datasets, i.e., Moviescope and MovieBricks. Particularly, MovieBricks is the first multi-modal movie dataset in China, comprising over 4000 movies with four different modalities, including synopsis, poster, trailer, and metadata. The results demonstrate the superior performance of the proposed model MFMGC compared with the state-of-the-art methods.

The rest of the paper is organized as follows. The related work is presented in Sect. 2 and the task MGC is formulated in Sect. 3. The proposed model MFMGC is introduced in Sect. 4. We report the experimental results in Sect. 5, which is followed by the conclusion in Sect. 6.

2 Related Work

2.1 Research on Movies

Due to its rich storytelling and high-quality footage, the movies have become a valuable resource for researchers. Current studies on movies can be categorized into three directions: analyzing the content of movies, examining the impact of movies, and studying the characteristics of movies. Researches on movie content mainly use movie trailers as video data, e.g., scene boundary detection [9, 20], which aims to divide a video into easily interpretable parts to communicate a storyline effectively, and action recognition [21, 26] which utilizes the video scripts that exist for thousands of movies to automatically extract and track faces together with corresponding motion features. Studies on movie influence include movie box office prediction [16, 28] and movie review analysis [13, 23]. The movie box office prediction before its theatrical release can decrease its financial risk, and movie review analysis is a task of Natural Language Processing, which is able to obtain the emotional or semantic information of the movie’s review. In addition, the studies on movie characteristics include understanding the relationships of movie characters [3, 15], which aims to weigh the importance of character in defining a story, and movie genre classification [1, 4, 7, 8, 18, 25].

2.2 Movie Genre Classification

To better contextualize our study, we review existing work focusing on multi-modal data, with particular emphasis on fusion strategies, and introduce them in a chronological order.

Wehrmann et al. [25] propose a novel deep neural architecture called CTT-MMC for multi-label movie-trailer genre classification. The authors utilize both video and audio data, and the fusion strategy involves a Maxout layer before the class prediction, which can be interpreted as a late fusion strategy. John et al. [1] propose a novel model for multi-modal learning based on gated neural networks for MGC. They utilize the plot and poster data for the classification task. The gated mechanism is used to obtain the weights of different modalities and then weight sums them for the final classification. The model is also utilized in other work such as [7]. Cascante et al. [8] compare the effectiveness of visual, audio, text, and metadata-based features in predicting movie genres. They utilize trainable parameters to sum different features of multi-modal data. Behrouzi et al. [4] design a new structure based on Gated Recurrent Unit (GRU) to extract spatial-temporal features from the movie-related data. The authors concatenate the video and audio features to predict the final genres of movies. Mangolin et al. [18] extract features by computing different kinds of descriptors, and then combine classifiers through the calculation of predicted score for each class, and they propose three rules for fusion, i.e., Sum, Prod, and Max.

In summary, current research on MGC with multi-modal data mainly utilizes late fusion strategies, such as concatenation and weighted sum, failing to capture the interaction between different modalities, and they ignore the structural information contained by metadata. To further improve the performance of their designed methods, we propose the novel model MFMGC in this study.

3 Problem Formulation

Given a set of movies \(\{M_1, \cdots , M_N\}\), each movie \(M_i\) is associated with multi-modal attributes and metadata, i.e., \(M_i = \{M^{t}_i, M^{p}_i, M^{v}_i, M^{a}_i, M^{m}_i\}\). Detailedly, \(M^{t}_i\) denotes the textual data consisting of the movie’s title and synopsis, \(M^{p}_i\) represents the movie’s poster, \(M^{v}_i\) denotes the visual data that is a sequence of frame-level patches in the trailer, \(M^{a}_i\) represents the audio fragments extracted from the trailer, and \(M^{m}_i\) is the metadata of the movie. Moreover, \(C = \{c_1, \cdots , c_L\}\) is the genre set, where L is the number of movie genres.

Intuitively, movies with a significant overlap of directors, screenwriters, and actors may belong to the same genre, and a corresponding example is presented in Fig. 1. To capture such information effectively, we construct a multi-modal movie graph. Specifically, the graph is denoted as \(G=\{V, E\}\), where V is the set of movie nodes, i.e., \(V=\{M_1, \cdots , M_N\}\), and E is the set of connections between each pair of movie nodes. Additionally, we design an adjacency matrix A for the edge set E, where \(A_{ij}\) represents whether there is an edge between \(M_i\) and \(M_j\). Given a threshold \(\mathcal {T}\), if the overlap of directors, screenwriters, and actors between \(M_i\) and \(M_j\) exceeds \(\mathcal {T}\), \(A_{ij}\) is set to 1. Otherwise, \(A_{ij}\) is set to 0.

Fig. 1.
figure 1

A multi-modal movie graph, where each node has four multi-modal attributes, i.e., text, poster, video, and audio. Different movie nodes are connected according to the overlap of their directors, screenwriters, and actors.

Definition 1

Movie Genre Classification (MGC). Given a movie \(M_i\) from the dataset \(\{M_1, \cdots , M_N\}\) and a genre set C, the task of MGC aims to learn a function \(\varPhi \) to predict the genres of movie \(M_i\) based on \(M^{t}_i\), \(M^{p}_i\), \(M^{v}_i\), \(M^{a}_i\), and \(M^{m}_i\). This process is formulated as follows:

$$\begin{aligned} P_i = \varPhi (M^{t}_i, M^{p}_i, M^{v}_i, M^{a}_i, M^{m}_i), \end{aligned}$$
(1)

where \(P_i = \{c_{x}, \cdots , c_{y}\}\) is the set of genres assigned to the movie \(M_i\) and each genre in \(\{c_{x}, \cdots , c_{y}\}\) is from C.

Note that MGC is a multi-label classification task [27] and each movie may belong to multiple genres at the same time. For instance, the movie “X-Men: The Last Stand” has multiple genres, i.e., Action, Horror, and Sci-Fic.

4 Proposed Model

4.1 Overview

To effectively utilize the multi-modal data of movies to conduct MGC, we propose a novel model namely MFMGC. Observed from Fig. 2, the model contains two modules, i.e., Multi-modal Data Fusion (MDF) and Movie Graph Representation Learning (MGRL), and the details of them are as follows.

Fig. 2.
figure 2

Overview of the proposed model MFMGC

To feed the movie data into the module MDF, we first segment the audio and frame the video into patches at the frame level, and then use different pre-trained models to embed the text, posters, video frames, and audio segments. Next, these embeddings are fed into MDF, which consists of two stages. Specifically, in the first stage, feature pre-extraction is performed. For the embeddings of text and video frames, we take them as input and utilize the Transformer encoder as the backbone to fuse text and video modalities, inspired by VLBert [22] that feeds both words in the sentence and region-of-interest (RoI) from the image into the Transformer encoder. For the posters and audio data, we separately design multi-layer perceptron (MLP) layers to perform the feature pre-extraction. In the second stage, we adopt a modal-attention layer to fuse extracted features, ensuring that the multi-modal data could be effectively integrated, resulting in a comprehensive representation of each movie. In MGRL, we deploy a GCN-based architecture to fine-tune the movie representations obtained from MDF and extract structural information between movie nodes.

4.2 Multi-modal Data Embedding

This section details the embedding process of the synopsis, poster, trailer, and audio data. We introduce how to transform these data into a suitable format and then feed the module MDF for feature pre-extraction and fusion.

Text Embedding. We utilize a Transformer Encoder structure to extract text features, where the text data is embedded by the Bert Embedding [12] module. Specifically, given the textual data \(M^{t}_i\) of the movie \(M_i\), \(M^{t}_i\) contains a token sequence, which is denoted as \(\{w_1, w_2, \cdots , w_l\}\) and l is the number of tokens in \(M_i^{t}\). The pre-trained BertEmbedding module is used to obtain the token sequence’s embedding and the process is formally defined as:

$$\begin{aligned} E^{t}_i = BertEmbed(M^{t}_i), \end{aligned}$$
(2)

where \(BertEmbed(\cdot )\) is the Bert Embedding module and \(E^{t}_i \in \mathbb {R}^{l\times h^t}\) is the embedding of \(M^{t}_i\).

Video Embedding. To obtain valuable information from the video data of the movie, we first extract frames at a rate of one frame per second (FPS). The extracted frames are then processed to obtain high-level dimensional features based on the Swin Transformer. Specifically, the visual data \(M^{v}_i\) of movie \(M_i\) consists of p video frames, and we use the following method to embed it:

$$\begin{aligned} E^{v}_i=SwinSmall(M^{v}_i), \end{aligned}$$
(3)

where \(SwinSmall(\cdot )\) is one of Swin Transformer model [17], \(E^{v}_i \in \mathbb {R}^{p \times h^{v}}\) is the embedding of video frames of the i-th movie, and each frame is embedded to a vector with the dimension of \(h^v\).

Poster Embedding. In addition to video data, posters are also important visual data for movies, containing rich information about the movie’s genre to attract audiences with specific preferences. We feed the poster into the Swin Transformer to obtain its embedding and the process can be formally defined as:

$$\begin{aligned} E^{p}_i=SwinSmall(M^{p}_i), \end{aligned}$$
(4)

where \(M^{p}_i\) is the poster data, and \(E^{p}_i \in \mathbb {R} ^ {h^v}\) is the poster embedding of the i-th movie.

Audio Embedding. Apart from the above-mentioned information, we also extract features from audio, since different genres of movies usually have different types of soundtracks. For instance, while both Comedy and Action genres may have visually bright scenes, the background music of Comedy movies tends to have a more cheerful instead of intense rhythm. To capture latent features from the audio, we learn corresponding embeddings according to Wav2Vec2 [2]. The audio data is denoted as \(M^{a}_i = \{o_1, o_2, \cdots , o_u\}\), where \(o_j\) is the j-th fragment of the given audio, with a sample rate of 16000, and each fragment is a 3-second audio signal. Note that we adopt a mean pooling operation to obtain the audio embedding from the embeddings of fragments, and the process is as follows:

$$\begin{aligned} E^{a}_i=MP(Wav2Vec2(M^{a}_i)), \end{aligned}$$
(5)

where \(Wav2Vec2(\cdot )\) is a Wav2Vec2 layer, \(MP(\cdot )\) is the mean pooling operation, and \(E^a_i \in \mathbb {R}^{h^{a}}\) is the audio embedding of the i-th movie.

Ultimately, the embedding of the i-th movie’s multi-modal data can be represented as \(\mathcal {E}_i=\{E^{t}_i, E^{v}_i, E^{p}_i, E^{a}_i\}\), which is then fed into the module MDF.

4.3 Multi-modal Data Fusion - MDF

The attention mechanism in Transformer has been proven powerful and flexible to differentially weigh the significance of each part of the input data. In MDF, we utilize this mechanism to fuse multi-modal embeddings, which involve two stages. In the first stage, the Transformer Encoder and MLP are used to extract latent features from different input embeddings. In the second stage, we adopt a modal-attention layer to fuse the features extracted at the first stage.

Feature Extraction of MDF. The Transformer Encoder is particularly effective in extracting sequential features, making it suitable for processing text and video frames. Specifically, in MDF, we first concatenate the embeddings of text and video frames as \(\mathcal {E}^{tv}_i = E^{t}_i \Vert E^{v}_i\), where \(\Vert \) denotes the concatenation operation, and \(\mathcal {E}^{tv}_i \in \mathbb {R}^{(l + p)\times h^t}\). Then, the concatenated embedding \(\mathcal {E}^{tv}_i\) is fed into the fusion module, which consists of a Transformer encoder [24] and a Mean pooling layer. The calculation process is formulated as follows:

$$\begin{aligned} O^{tv}_i = MP(TransEncoder({E}^{tv}_i)), \end{aligned}$$
(6)

where \(TransEncoder(\cdot )\) denotes Transformer Encoder.

For poster and audio embeddings, we employ two multi-layer perceptron (MLP) layers to extract their features respectively. The MLP layer consists of two fully connected layers with a ReLU activation function in the middle. The process can be formulated as follows:

$$\begin{aligned} O^{p}_i = ReLU(E^{p}_iW^{p}_1 + b^{p}_1)W^{p}_2 + b^{p}_2, \end{aligned}$$
(7)
$$\begin{aligned} O^{a}_i = ReLU(E^{a}_iW^{a}_1 + b^{a}_1)W^{a}_2 + b^{a}_2, \end{aligned}$$
(8)

where \(E^{p/a}_i\) denotes the embedding of posters or audio, \(W^{p/a}_1\) and \(W^{p/a}_2\) are the weight matrices of the two fully connected layers, \(b^{p/a}_1\) and \(b^{p/a}_2\) are biases, and \(\text {ReLU}(\cdot )\) is the Rectified Linear Unit activation function. After the features are extracted, the set of representations \(\mathcal {O}_i=\{O^{tv}_i, O^{p}_i, O^{a}_i\}\) is obtained.

Modal-Attention Layer of MDF. Following the feature extraction, we apply modal attention to fuse the features of different modalities. Specifically, the feature of text-video \(O^{tv}_i\) is first transformed into a new vector \(\tilde{O}^{tv}_i\) that has the same dimension with the poster and audio embeddings, through a Linear layer. Then the multi-modal input features are first concatenated to obtain \(\hat{\mathcal {O}_i} \in \mathbb {R}^{m \times h}\), where m is the number of features in \(\mathcal {O}_i\). Next, the query matrix \(Q_i = \hat{\mathcal {O}_i} W_q\) is obtained through the projection matrix \(W_q\), while the key matrix \(K_i\) and value matrix \(V_i\) are obtained using \(W_k\) and \(W_v\), respectively. The scaled dot product function is used as the attention function, and the inter-modal attention matrix \(P_i\) is obtained with following method,

$$\begin{aligned} P_i = softmax(\frac{Q_i K_{i}^{T}}{\sqrt{h}}), \end{aligned}$$
(9)

where \(P_i \in \mathbb {R}^{m \times m}\) and each element \(P_{i,xy}\) of the matrix represents the inter-modal attention between the x-th and y-th modality of the i-th movie \(M_i\). Then, the multi-modal representation of \(M_i\), which is denoted as \(F_i\), is obtained through attention aggregation and the map function \(\mathcal {V}\). Additionally, a residual connection is added to avoid the problem of vanishing gradients during training, and the process can be represented as follows:

$$\begin{aligned} F_i = \mathcal {V}(P_i V_i + \mathcal {O}_i), \end{aligned}$$
(10)

where \(\mathcal {V(\cdot )}\) denotes the vectorization by row-wise concatenation, and \(F_i \in \mathbb {R}^{1 \times mh}\). Finally, we obtain \(\mathcal {F}=\{F_1,F_2,\cdots ,F_N\}\), which contains the multi-modal representations of all movies in the given dataset.

4.4 Movie Graph Representation Learning - MGRL

To fully explore the structural and semantic information of movies in a unified manner, we construct a multi-modal movie graph based on movies’ directors, screenwriters, and actors. Here, the movie nodes have fused representations that are obtained in MDF based on movie-related multi-modal attributes, i.e., synopsis, poster, and trailer. To effectively extract structural information from the graph, we adopt a two-layer GCN to fine-tune the movie representations and the process is as follows:

$$\begin{aligned} \mathcal {H} = GCN(\mathcal {F}, A) = ReLU(\tilde{A} ReLU(\tilde{A} \mathcal {F} W^0) W^1), \end{aligned}$$
(11)

where \(\mathcal {H}=\{H_1, H_2, \cdots , H_N\}\) denotes the new set of movie representations, \(H_i(1\le i\le N)\) is the fine-tuned embedding of movie \(M_i\). A is the adjacency matrix of the movie graph and \(\tilde{A} =\tilde{D}^{-\frac{1}{2}} (A + I_N) \tilde{D}^{-\frac{1}{2}}\). \(I_N\) is the identity matrix with size \(N\times N\), where N denotes the number of movies in the graph. \(\tilde{D}\) is the diagonal degree matrix of \(\tilde{A}\), which is defined as \(\tilde{D}_{ii} = \sum _{j=1}^{N}\tilde{A}_{ij}\). \(W^0\) and \(W^1\) are learnable parameters.

4.5 Classification Layer

Ultimately, to tackle the task of MGC, we use a linear projection followed by a sigmoid function to predict the movie’s genre. This can be formally defined as:

$$\begin{aligned} \mathcal {S}^1/\mathcal {S}^2=Sigmoid(Linear(\mathcal {F}/\mathcal {H})), \end{aligned}$$
(12)

where \(Sigmoid(\cdot )\) is the activation function that is used to squash the output vector values to range [0, 1], which can be interpreted as the vector of genre probability. Note that, as there has been no work constructed above-mentioned movie graph, to give a more fair comparison, the input of the classification layer can be either \(\mathcal {F}\) or \(\mathcal {H}\). Consequently, the output can be either \(\mathcal {S}^1\) or \(\mathcal {S}^2\). Taking \(\mathcal {S}^1=\{S_1,S_2,\cdots ,S_N\}\) as an example, \(S_i \in \mathcal {S}^1\) is the genre probability vector of the i-th movie \(M_i\), which is denoted as \(S_i=\{s_{i1},\cdots ,s_{iL}\}\). Here, \(s_{ij} \in S_i\) represents the probability that \(M_i\) belongs to the j-th genre, and L is the number of genres.

4.6 Training

The model is optimized by Binary Cross-Entropy Loss (BCELoss). The labels of movies are first embedded and denoted as \(\mathcal {C} = \{C_1, C_2, \cdots , C_N\}\). For the i-th movie, the genres set is \(C_i=\{c_{i1}, c_{i2}, \cdots , c_{iL}\}\), where \(c_{ij} \in \{0,1\}\), and \(c_{ij}=1\) indicates that the i-th movie belongs to the j-th genre. The formulation for the loss function is as follows:

$$\begin{aligned} \begin{aligned} \mathcal {L} &=BCELoss(\mathcal {C}, \mathcal {S}) \\ &= - \frac{1}{L} \sum _{i=1}^{N} \sum _{j=1}^{L} (c_{ij} \log (s_{ij})+(1-c_{ij}) \log (1-s_{ij})), \end{aligned} \end{aligned}$$
(13)

where N is the number of movies.

In addition, when adding the module MGRL, we adopt a joint loss function to guide the optimization of both MDF and MGRL:

$$\begin{aligned} \mathcal {L} = BCELoss(\mathcal {C}, \mathcal {S}^1) + BCELoss(\mathcal {C}, \mathcal {S}^2). \end{aligned}$$
(14)

5 Experiments

5.1 Dataset

Most of the datasets used in current research are either not open or the access paths have expired, particularly for datasets that contain multiple data sources such as synopses, posters, trailers, and metadata. We start with downloading the dataset Moviescope, which contains movies’ synopsis, posters, and URLs of trailers on YouTube, and then develop a Python crawler to obtain the trailers. To enable a more comprehensive evaluation of our model, we create a new dataset from Douban, the most active online movie review and dataset platform in China. The details of the two datasets are as follows.

The dataset Moviescope, all data sources of which are available, contains 4076 movies with 13 different genres. Additionally, the dataset MovieBricks has 4063 movies with 10 different genres, namely Action, Thriller, Adventure, Story, Science-Fiction, Love, Fantasy, Comedy, Terror and Crime. Both two datasets are divided into training, validation, and testing sets in a 7:1:2 ratio. Note that a movie may belong to multiple genres at the same time.

5.2 Comparison Method

To validate the effectiveness of MFMGC, we compare its performance with those of several state-of-the-art approaches that are introduced as follows.

  • GMU [1]. This work develops a model for multi-modal learning based on gated neural networks, which is evaluated on a multi-label scenario for MGC using synopses and posters.

  • Fast-MA [8]. This work designs a temporal feature aggregator to embed video and text, and compares the effectiveness of visual, textual-based methods on MGC, and it is denoted as Fast Modal Attention (Fast-MA).

  • DL-PO [19]. This work proposes a simple deep-learning model to predict the genres of a movie with overview and poster. We refer to it as Deep Learning for Posters and Overviews (DL-PO).

  • CMM [18]. This is a comprehensive study developed in terms of diversity of multimedia sources of information to perform MGC. We refer to it as Comprehensive Multi-modal Model (CMM).

  • MGC-RNN [4]. This work proposes a new structure based on GRU to derive spatial-temporal features of movie frames and then concatenates them with the audio features to predict the final genres of the movie. We refer to it as MGC-RNN.

5.3 Evaluation Metrics and Parameter Settings

Evaluation Metrics. AUC-ROC [11] is a well-known metric that measures the area under the receiver operating characteristic (ROC) curve. This curve plots the true positive rate against the false positive rate for each possible threshold of the classifier’s output. However, relying on a single metric cannot provide a comprehensive evaluation of a multi-label classifier. Therefore, we also calculate the F1 score that has been widely used to evaluate the multi-label classifiers [1, 4]. To globally evaluate the performance of different methods, we compute the micro and macro averages of the F1 and AUC metrics. The micro-average calculates the mean of scores without considering genres, while the macro-average computes the score of each genre independently and takes their unweighted means.

Table 1. Experimental results. The used information contains Text (T), Poster (P), Audio (A), Video (V), and Movie Graph (G). Furthermore, “ma” and “mi” are used to represent the macro and micro averages.

Parameter Settings. As mentioned in Sect. 4.2, for the textual modality \(M^t\), a fixed sequence length \(l=256\) is used. For the video modality \(M^v\), we draw \(p=32\) frames from the trailer, and for the audio modality \(M^a\), the number of audio segments is \(u=16\), and the hidden dimension h in our model is set to 256. To reduce the impact of random noise, all experiments are conducted using the 5-fold cross-validation. The results reported are the average of 5 runs using different data partitions. The pre-trained model “Roberta” [10] is utilized to initialize the Transformer Encoder module. To maintain the learned knowledge of pre-trained parameters, we split the learnable parameters into two parts: the learning rate for pre-trained initialized parameters is set to 0.00005, while the learning rate for randomly initialized parameters is 0.0005, and they are denoted as “pre-lr” and “rand-lr” respectively.

5.4 Experiment Results

Experiments are done on a machine with 2 NVIDIA V100 GPUs. The performances are presented in Table 1. As all baselines are designed without considering movie graph, to provide a fair comparison, we present the results of our model’s simplified version MFMGC-P that only utilize partial input data, i.e., synopsis, poster, and trailer of movies. Moreover, MFMGC represents the model that considers the movie’s metadata and multi-modal data along with the movie graph. Note that, as the movies from Moviescope only contain few metadata, the movie graph cannot be constructed, thus MFMGC has no result on this dataset. In addition, we only present the results of MFMGC on MovieBricks when utilizing text data and movie graph, due to the space limitation. More results of the model from P+G, V+G, A+G to TPVA+G are presented on github1.

Main Results. Observed from Table 1, our proposed model MFMGC-P consistently achieves better performance than baselines. Specifically, the “mi-f1” score of it outperforms GMU by 11.6% on MovieBricks. Even only using poster or video data, MFMGC-P still performs better than other methods, indicating its powerful feature extraction capability. When all multi-modal attributes are taken into account, MFMGC-P achieves higher improvements, which demonstrates the effectiveness of the carefully designed fusion strategy based on attention mechanism. Detailedly, the reasons for the above-mentioned observations are as follows: 1) MFMGC-P utilizes advanced pre-trained models to embed data, the parameters hold abundant knowledge, especially for text and images, which leads to better representations than the traditional models such as Word2vec and VGG. 2) We fuse the multi-modal data via the attention mechanism that differentially weighs the significance of each part of the input data, allowing MFMGC-P to learn a comprehensive representation.

Modality Analysis. To fully investigate the impacts of different modalities on the performance of the proposed model, we compare the results of MFMGC-P when adding single, two, three, and all four modalities. Seen from Table 1, the results of the second part (i.e., TP, TV, TA, PV, PA, and VA) of MFMGC-P outperform those of single-modal data based experiments (i.e., T, P, V, and A). Without surprise, when taking four modalities (i.e., TPVA) into account, MFMGC-P achieves the best performance. These observations demonstrate the effectiveness of the developed module in extracting multi-modal features. Additionally, the higher performance of MFMGC (i.e., T+G) than that of MFMGC-P (i.e., T) demonstrates the significance of the construction of movie graph that can be used to extract structural information between movies.

5.5 Parameter Analysis

To investigate the effect of different learning rates and compare the experimental results with above-mentioned ones more intuitively, the performances of MFMGC-P with varying “pre-lr” and “rand-lr” are reported in Fig. 3(a) and Fig. 3(b). Observed from this, the evaluation metrics present a overall downward trend, where the “ma-f1” score even drops by 21.4%. The reason for the decrease is that setting a lower “pre-lr” can avoid forgetting the knowledge contained by the pre-trained parameters. Additionally, given a too-large “rand-lr”, it will lead to faster convergence but makes the model difficult to achieve the best result, as the global optimum may be missed during the iteration.

Furthermore, we analyze the effect of the hidden dimension h, and the results are reported in Fig. 3(c). While varying h from 64 to 512, we first observe the increase of evaluation metrics, then they present a decreasing tendency, and the best performance is achieved when \(h=256\). Given a too small dimension, the model cannot learn enough information, leading to under-fitting. Conversely, when h is too large, it may introduce unexpected noisy information, resulting in poor performance.

Fig. 3.
figure 3

Parameter analysis for “pre-lr”, “rand-lr”, and the hidden dimension h.

6 Conclusion

We propose MFMGC, which is a novel model for the task of MGC that utilizes the movie’s synopsis, poster, trailer and metadata, and the model comprises two modules: MDF and MGRL. MDF leverages the attention mechanism to capture the modalities’ interactive information effectively. In MGRL, we construct a graph to capture the structural relationships between movies based on directors, screenwriters, and actors, where the node in the graph is a movie that has multi-modal attributes and is first represented by MDF. Then a Graph Convolutional Network (GCN)-based architecture is developed to extract structural information between movie nodes. In addition, we also present a new multi-modal movie dataset, i.e., MovieBricks. The experimental results on Moviescope and MovieBricks demonstrate the superior performance of MFMGC.