MovieNet: A Holistic Dataset for Movie Understanding

Huang, Qingqiu; Xiong, Yu; Rao, Anyi; Wang, Jiaze; Lin, Dahua

doi:10.1007/978-3-030-58548-8_41

Qingqiu Huang¹²,
Yu Xiong¹²,
Anyi Rao¹²,
Jiaze Wang¹² &
…
Dahua Lin¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12349))

Included in the following conference series:

European Conference on Computer Vision

6232 Accesses
73 Citations

Abstract

Recent years have seen remarkable advances in visual understanding. However, how to understand a story-based long video with artistic styles, e.g. movie, remains challenging. In this paper, we introduce MovieNet – a holistic dataset for movie understanding. MovieNet contains 1, 100 movies with a large amount of multi-modal data, e.g. trailers, photos, plot descriptions, etc.. Besides, different aspects of manual annotations are provided in MovieNet, including 1.1 M characters with bounding boxes and identities, 42 K scene boundaries, 2.5 K aligned description sentences, 65 K tags of place and action, and 92 K tags of cinematic style. To the best of our knowledge, MovieNet is the largest dataset with richest annotations for comprehensive movie understanding. Based on MovieNet, we set up several benchmarks for movie understanding from different angles. Extensive experiments are executed on these benchmarks to show the immeasurable value of MovieNet and the gap of current approaches towards comprehensive movie understanding. We believe that such a holistic dataset would promote the researches on story-based long video understanding and beyond. MovieNet will be published in compliance with regulations at https://movienet.github.io.

Q. Huang and Y. Xiong—Equal contribution.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Condensed Movies: Story Based Retrieval with Contextual Embeddings

MND: A New Dataset and Benchmark of Movie Scenes Classified by Their Narrative Function

iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning

1 Introduction

“You jump, I jump, right?” When Rose gives up the lifeboat and exclaims to Jack, we are all deeply touched by the beautiful moving love story told by the movie Titanic. As the saying goes, “Movies dazzle us, entertain us, educate us, and delight us”. Movie, where characters would face various situations and perform various behaviors in various scenarios, is a reflection of our real world. It teaches us a lot such as the stories took place in the past, the culture and custom of a country or a place, the reaction and interaction of humans in different situations, etc.. Therefore, to understand movies is to understand our world.

It goes not only for human, but also for an artificial intelligence system. We believe that movie understanding is a good arena for high-level machine intelligence, considering its high complexity and close relation to the real world. What’s more, compared to web images [15] and short videos [7], the hundreds of thousands of movies in history containing rich content and multi-modal information become better nutrition for the data-hungry deep models.

Motivated by the insight above, we build a holistic dataset for movie understanding named MovieNet in this paper. As shown in Fig. 1, MovieNet comprises three important aspects, namely data, annotation, and benchmark.

First of all, MovieNet contains a large volume of data in multiple modalities, including movies, trailers, photos, subtitles, scripts and meta information like genres, cast, director, rating etc.. There are totally 3 K hour-long videos, 3.9 M photos, 10 M sentences of text and 7 M items of meta information in MovieNet.

From the annotation aspect, MovieNet contains massive labels to support different research topics of movie understanding. Based on the belief that middle-level entities, e.g. character, place, are important for high-level story understanding, various kinds of annotations on semantic elements are provided in MovieNet, including character bounding box and identity, scene boundary, action/place tag and aligned description in natural language. In addition, since movie is an art of filming, the cinematic styles, e.g., view scale, camera motion, lighting, etc., are also beneficial for comprehensive video analysis. Thus we also annotate the view scale and camera motion for more than 46 K shots. Specifically, the annotations in MovieNet include: (1) 1.1 M characters with bounding boxes and identities; (2) 40 K scene boundaries; (3) 65 K tags of action and place; (4) 12 K description sentences aligned to movie segments; (5) 92 K tags of cinematic styles.

Based on the data and annotations in MovieNet, we exploit some research topics that cover different aspects of movie understanding, i.e. genre analysis, cinematic style prediction, character analysis, scene understanding, and movie segment retrieval. For each topic, we set up one or several challenging benchmarks. Then extensive experiments are executed to present the performances of different methods. By further analysis on the experimental results, we will also show the gap of current approaches towards comprehensive movie understanding, as well as the advantages of holistic annotations for throughout video analytics.

To the best of our knowledge, MovieNet is the first holistic dataset for movie understanding that contains a large amount of data from different modalities and high-quality annotations in different aspects. We hope that it would promote the researches on video editing, human-centric situation understanding, story-based video analytics and beyond.

2 Related Datasets

Existing Works. Most of the datasets of movie understanding focus on a specific element of movies, e.g. genre [49, 66], character [1, 3, 19, 26, 29, 39, 51], action [5, 6, 18, 32, 37], scene [11, 14, 25, 40, 41, 43] and description [47]. Also their scale is quite small and the annotation quantities are limited. For example, [3, 19, 51] take several episodes from TV series for character identification, [32] uses clips from twelve movies for action recognition, and [40] exploits scene segmentation with only three movies. Although these datasets focus on some important aspects of movie understanding, their scale is not enough for the data-hungry learning paradigm. Furthermore, the deep comprehension should go from middle-level elements to high-level story while each existing dataset can only support a single task, causing trouble for comprehensive movie understanding.

MovieQA. MovieQA [54] consists of 15 K questions designed for 408 movies. As for sources of information, it contains video clips, plots, subtitles, scripts, and DVS (Descriptive Video Service). To evaluate story understanding by QA is a good idea, but there are two problems. (1) Middle-level annotations, e.g., character identities, are missing. Therefore it is hard to develop an effective approach towards high-level understanding. (2) The questions in MovieQA come from the wiki plot. Thus it is more like a textual QA problem rather than story-based video understanding. A strong evidence is that the approaches based on textual plot can get a much higher accuracy than those based on “video+subtitle”.

LSMDC. LSMDC [45] consists of 200 movies with audio description (AD) providing linguistic descriptions of movies for visually impaired people. AD is quite different from the natural descriptions of most audiences, limiting the usage of the models trained on such datasets. And it is also hard to get a large number of ADs. Different from previous work [45, 54], we provide multiple sources of textual information and different annotations of middle-level entities in MovieNet, leading to a better source for story-based video understanding.

AVA. Recently, AVA dataset [24], an action recognition dataset with 430 15-min movie clips annotated with 80 spatial-temporal atomic visual actions, is proposed. AVA dataset aims at facilitating the task of recognizing atomic visual actions. However, regarding the goal of story understanding, the AVA dataset is not applicable since (1) The dataset is dominated by labels like stand and sit, making it extremely unbalanced. (2) Actions like stand, talk, watch are less informative in the perspective of story analytics. Hence, we propose to annotate semantic level actions for both action recognition and story understanding tasks.

MovieGraphs. MovieGraphs [55] is the most related one that provides graph-based annotations of social situations depicted in clips of 51 movies. The annotations consist of characters, interactions, attributes, etc.. Although sharing the same idea of multi-level annotations, MovieNet is different from MovieGraphs in three aspects: (1) MovieNet contains not only movie clips and annotations, but also photos, subtitles, scripts, trailers, etc., which can provide richer data for various research topics. (2) MovieNet can support and exploit different aspects of movie understanding while MovieGraphs focuses on situation recognition only. (3) The scale of MovieNet is much larger than MovieGraphs.

Table 1. Comparison between MovieNet and related datasets in terms of data.

Full size table

Table 2. Comparison between MovieNet and related datasets in terms of annotation.

Full size table

3 Visit MovieNet: Data and Annotation

MovieNet contains various kinds of data from multiple modalities and high-quality annotations on different aspects for movie understanding. Figure 2 shows the data and annotations of the movie Titanic in MovieNet. Comparisons between MovieNet and other datasets for movie understanding are shown in Tables 1 and 2. All these demonstrate the tremendous advantage of MovieNet on both quality, scale and richness.

3.1 Data in MovieNet

Movie. We carefully selected and purchased the copies of 1, 100 movies, the criteria of which are (1) colored; (2) longer than 1 h; (3) cover a wide range of genres, years and countries.

Metadata. We get the meta information of the movies from IMDb and TMDb^{Footnote 1}, including title, release date, country, genres, rating, runtime, director, cast, storyline, etc.. Here we briefly introduce some of the key elements, please refer to supplementary material for detail: (1) Genre is one of the most important attributes of a movie. There are total 805 K genre tags from 28 unique genres in MovieNet. (2) For cast, we get both their names, IMDb IDs and the character names in the movie. (3) We also provide IMDb ID, TMDb ID and Douban ID of each movie, with which the researchers can get additional meta information from these websites conveniently. The total number of meta information in MovieNet is 375 K. Please note that each kind of data itself, even without the movie, can support some research topics [31]. So we try to get each kind of data as much as we can. Therefore the number here is larger than 1, 100. So as other kinds of data we would introduce below.

Subtitle. The subtitles are obtained in two ways. Some of them are extracted from the embedded subtitle stream in the movies. For movies without original English subtitle, we crawl the subtitles from YIFY^{Footnote 2}. All the subtitles are manually checked to ensure that they are aligned to the movies.

Trailer. We download the trailers from YouTube according to their links from IMDb and TMDb. We found that this scheme is better than previous work [10], which use the titles to search trailers from YouTube, since the links of the trailers in IMDb and TMDb have been manually checked by the organizers and audiences. Totally, we collect 60 K trailers belonging to 33 K unique movies.

Script. Script, where the movement, actions, expression and dialogs of the characters are narrated, is a valuable textual source for research topics of movie-language association. We collect around 2 K scripts from IMSDb and Daily Script^{Footnote 3}. The scripts are aligned to the movies by matching the dialog with subtitles.

Synopsis. A synopsis is a description of the story in a movie written by audiences. We collect 11 K high-quality synopses from IMDb, all of which contain more than 50 sentences. Synopses are also manually aligned to the movie, which would be introduced in Sect. 3.2.

Photo. We collect 3.9 M photos of the movies from IMDb and TMDb, including poster, still frame, publicity, production art, product, behind the scene and event.

3.2 Annotation in MovieNet

To provide a high-quality dataset supporting different research topics on movie understanding, we make great effort to clean the data and manually annotate various labels on different aspects, including character, scene, event and cinematic style. Here we just demonstrate the content and the amount of annotations due to the space limit. Please refer to supplementary material for details.

Cinematic Styles. Cinematic style, such as view scale, camera movement, lighting and color, is an important aspect of comprehensive movie understanding since it influences how the story is telling in a movie. In MovieNet, we choose two kinds of cinematic tags for study, namely view scale and camera movement. Specifically, the view scale include five categories, i.e. long shot, full shot, medium shot, close-up shot and extreme close-up shot, while the camera movement is divided into four classes, i.e. static shot, pans and tilts shot, zoom in and zoom out. The original definitions of these categories come from [22] and we simplify them for research convenience. We totally annotate 47 K shots from movies and trailers, each with one tag of view scale and one tag of camera movement.

Character Bounding Box and Identity. Person plays an important role in human-centric videos like movies. Thus to detect and identify characters is a foundational work towards movie understanding. The annotation process of character bounding box and identity contains 4 steps: (1) Some key frames, the number of which is 758 K, from different movies are selected for bounding box annotation. (2) A detector is trained with the annotations in step-1. (3) We use the trained detector to detect more characters in the movies and manually clean the detected bounding boxes. (4) We then manually annotate the identities of all the characters. To make the cost affordable, we only keep the top 10 cast in credits order according to IMDb, which can cover the main characters for most movies. Characters not belong to credited cast were labeled as “others”. In total, we got 1.1 M instances of 3, 087 unique credited cast and 364 K “others”.

Scene Boundary. In terms of temporal structure, a movie contains two hierarchical levels – shot, and scene. Shot is the minimal visual unit of a movie while scene is a sequence of continued shots that are semantically related. To capture the hierarchical structure of a movie is important for movie understanding. Shot boundary detection has been well solved by [48], while scene boundary detection, also named scene segmentation, remains an open question. In MovieNet, we manually annotate the scene boundaries to support the researches on scene segmentation, resulting in 42 K scenes.

Action/Place Tags. To understand the event(s) happened within a scene, action and place tags are required. Hence, we first split each movie into clips according to the scene boundaries and then manually annotated place and action tags for each segment. For place annotation, each clip is annotated with multiple place tags, e.g., {deck, cabin}. While for action annotation, we first detect sub-clips that contain characters and actions, then we assign multiple action tags to each sub-clip. We have made the following efforts to keep tags diverse and informative: (1) We encourage the annotators to create new tags. (2) Tags that convey little information for story understanding, e.g., stand and talk, are excluded. Finally, we merge the tags and filtered out 80 actions and 90 places with a minimum frequency of 25 as the final annotations. In total, there are 42 K segments with 19.6 K place tags and 45 K action tags.

Description Alignment. Since the event is more complex than character and scene, a proper way to represent an event is to describe it with natural language. Previous works have already aligned script [37], Descriptive Video Service (DVS) [45], book [67] or wiki plot [52,53,54] to movies. However, books cannot be well aligned since most of the movies would be quite different from their books. DVS transcripts are quite hard to obtain, limiting the scale of the datasets based on them [45]. Wiki plot is usually a short summary that cannot cover all the important events of the movie. Considering the issues above, we choose synopses as the story descriptions in MovieNet. The associations between the movie segments and the synopsis paragraphs are manually annotated by three different annotators with a coarse-to-fine procedure. Finally, we obtained 4, 208 highly consistent paragraph-segment pairs.

Table 3. (a) Comparison between MovieNet and other benchmarks for genre analysis. (b) Results of some baselines for genre classification in MovieNet

Full size table

4 Play with MovieNet: Benchmark and Analysis

With a large amount of data and holistic annotations, MovieNet can support various research topics. In this section, we try to analyze movies from five aspects, namely genre, cinematic style, character, scene and story. For each topic, we would set up one or several benchmarks based on MovieNet. Baselines with currently popular techniques and analysis on experimental results are also provided to show the potential impact of MovieNet in various tasks. The topics of the tasks have covered different perspectives of comprehensive movie understanding. But due to the space limit, here we can only touched the tip of the iceberg. More detailed analysis are provided in the supplementary material and more interesting topics to be exploited are introduced in Sect. 5.

4.1 Genre Analysis

Genre is a key attribute for any media with artistic elements. To classify the genres of movies has been widely studied by previous works [10, 49, 66]. But there are two drawbacks for these works. (1) The scale of existing datasets is quite small. (2) All these works focus on image or trailer classification while ignore a more important problem, i.e. how to analyze the genres of a long video.

MovieNet provides a large-scale benchmark for genre analysis, which contains 1.1 K movies, 68 K trailers and 1.6 M photos. The comparison between different datasets are shown in Table 3a, from which we can see that MovieNet is much larger than previous datasets.

Based on MovieNet, we first provide baselines for both image-based and video-based genre classification, the results are shown Table 3b. Comparing the result of genre classification in small datasets [10, 49] to ours in MovieNet, we find that the performance drops a lot when the scale of the dataset become larger. The newly proposed MovieNet brings two challenges to previous methods. (1) Genre classification in MovieNet becomes a long-tail recognition problem where the label distribution is extremely unbalanced. For example, the number of “Drama” is 40 times larger than that of “Sport” in MovieNet. (2) Genre is a high-level semantic tag depending on action, clothing and facial expression of the characters, and even BGM. Current methods are good at visual representation. When facing a problem that need to consider higher-level semantics, they would all fail. We hope MovieNet would promote researches on these challenging topics.

Another new issue to address is how to analyze the genres of a movie. Since movie is extremely long and not all segments are related to its genres, this problem is much more challenging. Following the idea of learning from trailers and applying to movies [30], we adopt the visual model trained with trailers as shot-level feature extractor. Then the features are fed to a temporal model to capture the temporal structure of the movie. The overall framework is shown in Fig. 3a. With this approach, we can get the genre response curve of a movie. Specifically, we can predict which part of the movie is more relevant to a specific genre. What’s more, the prediction can also be used for genre-guided trailer generation, as shown in Fig. 3b. From the analysis above, we can see that MovieNet would promote the development of this challenging and valuable research topic.

Table 4. (a) Comparison between MovieNet and other benchmarks for cinematic style prediction. (b) Results of some baselines for cinematic style prediction in MovieNet

Full size table

Table 5. Datasets for person analysis.

Full size table

Table 6. Results of (a) Character detection and (b) Character identification

Full size table

4.2 Cinematic Style Analysis

As we mentioned before, cinematic style is about how to present the story to audience in the perspective of filming art. For example, a zoom in shot is usually used to attract the attention of audience to a specific object. In fact, cinematic style is crucial for both video understanding and editing. But there are few works focusing on this topic and no large-scale datasets for this research topic too.

Based on the tags of cinematic style we annotated in MovieNet, we set up a benchmark for cinematic style prediction. Specifically, we would like to recognize the view scale and camera motion of each shot. Comparing to existing datasets, MovieNet is the first dataset that covers both view scale and camera motion, and it is also much larger, as shown in Table 4a. Several models for video clip classification such as TSN [57] and I3D [9] are applied to tackle this problem, the results are shown in Table 4b. Since the view scale depends on the portion of the subject in the shot frame, to detect the subject is important for cinematic style prediction. Here we adopt the approach from saliency detection [16] to get the subject maps of each shot, with which better performances are achieved, as shown in Table 4b. Although utilizing subject points out a direction for this task, there is still a long way to go. We hope that MovieNet can promote the development of this important but ignored topic for video understanding.

4.3 Character Recognition

It has been shown by existing works [36, 55, 58] that movie is a human-centric video where characters play an important role. Therefore, to detect and identify characters is crucial for movie understanding. Although person/character recognition is not a new task, all previous works either focus on other data sources [33, 35, 64] or small-scale benchmarks [3, 26, 51], leading to the results lack of convincingness for character recognition in movies.

Table 7. Dataset for scene analysis.

Full size table

Table 8. Datasets for story understanding in movies in terms of (1) number of sentences per movie; (2) duration (second) per segment.

Full size table

Table 9. Results of scene segmentation

Full size table

Table 10. Results of scene tagging

Full size table

We proposed two benchmarks for character analysis in movies, namely, character detection and character identification. We provide more than 1.1 M instances from 3, 087 identities to support these benchmarks. As shown in Table 5, MovieNet contains much more instances and identities comparing to some popular datasets about person analysis. The following sections will show the analysis on character detection and identification respectively.

Character Detection. Images from different data sources would have large domain gap, as shown in Fig. 4. Therefore, a character detector trained on general object detection dataset, e.g. COCO [35], or pedestrian dataset, e.g. CalTech [17], is not good enough for detecting characters in movies. This can be supported by the results shown in Table 6a. To get a better detector for character detection, we train different popular models [8, 34, 44] with MovieNet using toolboxes from [12, 13]. We can see that with the diverse character instances in MovieNet, a Cascade R-CNN trained with MovieNet can achieve extremely high performance, i.e. 95.17% in mAP. That is to say, character detection can be well solved by a large-scale movie dataset with current SOTA detection models. This powerful detector would then benefit research on character analysis in movies.

Character Identification. To identify the characters in movies is a more challenging problem, which can be observed by the diverse samples shown in Fig. 4. We conduct different experiments based on MovieNet, the results are shown in Table 6b. From these results, we can see that: (1) models trained on ReID datasets are inefficient for character recognition due to domain gap; (2) to aggregate different visual cues of an instance is important for character recognition in movies; (3) the current state-of-the-art can achieve 75.95% mAP, which demonstrates that it is a challenging problem which need to be further exploited.

4.4 Scene Analysis

As mentioned before, scene is the basic semantic unit of a movie. Therefore, it is important to analyze the scenes in movies. The key problems in scene understanding is probably where is the scene boundary and what is the content in a scene. As shown in Table 7, MovieNet, which contains more than 43K scene boundaries and 65K action/place tags, is the only one that can support both scene segmentation and scene tagging. What’s more, the scale of MovieNet is also larger than all previous works.

Scene Segmentation. We first test some baselines [2, 46] for scene segmentation. In addition, we also propose a sequential model, named Multi-Semtantic LSTM (MS-LSTM) based on Bi-LSTMs [23, 42] to study the gain brought by using multi-modality and multiple semantic elements, including audio, character, action and scene. From the results shown in Table 9, we can see that (1) Benefited from large scale and high diversity, models trained on MovieNet can achieve better performance. (2) Multi-modality and multiple semantic elements are important for scene segmentation, which highly raise the performance.

Action/Place Tagging. To further understand the stories within a movie, it is essential to perform analytics on the key elements of storytelling, i.e., place and action. We would introduce two benchmarks in this section. Firstly, for action analysis, the task is multi-label action recognition that aims to recognize all the human actions or interactions in a given video clip. We implement three standard action recognition models, i.e., TSN [57], I3D [9] and SlowFast Network [20] modified from [63] in experiments. Results are shown in Table 10. For place analysis, we propose another benchmark for multi-label place classification. We adopt I3D [9] and TSN [57] as our baseline models and the results are shown in Table 10. From the results, we can see that action and place tagging is an extremely challenging problem due to the high diversity of different instances.

4.5 Story Understanding

Web videos are broadly adopted in previous works [7, 60] as the source of video understanding. Compared to web videos, the most distinguishing feature of movies is the story. Movies are created to tell stories and the most explicit way to demonstrate a story is to describe it using natural language, e.g. synopsis. Inspired by the above observations, we choose the task of movie segment retrieval with natural language to analyze the stories in movies. Based on the aligned synopses in MovieNet, we set up a benchmark for movie segment retrieval. Specifically, given a synopsis paragraph, we aim to find the most relevant movie segment that covers the story in the paragraph. It is a very challenging task due to the rich content in movie and high-level semantic descriptions in synopses. Table 8 shows the comparison of our benchmark dataset with other related datasets. We can see that our dataset is more complex in terms of descriptions compared with MovieQA [54] while the segments are longer and contain more information than those of MovieGraphs [55].

Generally speaking, a story can be summarized as “somebody do something in some time at some place”. As shown in Fig. 5, both stories represented by language and video can be composed as sequences of {character, action, place} graphs. That being said, to understand a story is to (1) recognize the key elements of story-telling, namely, character, action, place etc.; (2) analyze the spatial-temporal structures of both movie and synopsis. Hence, our method first leverage middle-level entities (e.g. character, scene), as well as multi-modality (e.g. subtitle) to assist retrieval. Then we explore the spatial-temporal structure from both movies and synopses by formulating middle-level entities into graph structures. Please refer to supplementary material for details.

Using Middle-Level Entities and Multi-modality. We adopt VSE [21] as our baseline model where the vision and language features are embedded into a joint space. Specifically, the feature of the paragraph is obtained by taking the average of Word2Vec [38] feature of each sentence while the visual feature is obtained by taking the average of the appearance feature extracted from ResNet [27] on each shot. We add subtitle feature to enhance visual feature. Then different semantic elements including character, action and cinematic style are aggregated in our framework. We are able to obtain action features and character features thanks to the models trained on other benchmarks on MovieNet, e.g., action recognition and character detection. Furthermore, we observe that the focused elements vary under different cinematic styles. For example, we should focus more on actions in a full shot while more on character and dialog in a close-up shot. Motivated by this observation, we propose a cinematic-style-guided attention module that predicts the weights over each element (e.g., action, character) within a shot, which would be used to enhance the visual features. The experimental results are shown in Table 11. Experiments show that by considering different elements of the movies, the performance improves a lot. We can see that a holistic dataset which contains holistic annotations to support middle-level entity analyses is important for movie understanding.

Explore Spatial-Temporal Graph Structure in Movies and Synopses. Simply adding different middle-level entities improves the result. Moreover, as shown in Fig. 5, we observe that stories in movies and synopses persist two important structure: (1) the temporal structure in movies and synopses is that the story can be composed as a sequence of events following a certain temporal order. (2) the spatial relation of different middle-level elements, e.g., character co-existence and their interactions, can be formulated as graphs. We implement the method in [59] to formulate the above structures as two graph matching problems. The result are shown in Table 11. Leveraging the graph formulation for the internal structures of stories in movies and synopses, the retrieval performance can be further boosted, which in turn, show that the challenging MovieNet would provide a better source to story-based movie understanding.

Table 11. Results of movie segment retrieval. Here, G stands for global appearance feature, S for subtitle feature, A for action, P for character and C for cinematic style.

Full size table

5 Discussion and Future Work

In this paper, we introduce MovieNet, a holistic dataset containing different aspects of annotations to support comprehensive movie understanding. We introduce several challenging benchmarks on different aspects of movie understanding, i.e. discovering filming art, recognizing middle-level entities and understanding high-level semantics like stories. Furthermore, the results of movie segment retrieval demonstrate that integrating filming art and middle-level entities according to the internal structure of movies would be helpful for story understanding. These in turn, show the effectiveness of holistic annotations.

In the future, our work would go on in two aspects. (1) Extending the Annotation. In the future, we would further extend the dataset to include more movies and annotations. (2) Exploring more Approaches and Topics. To tackle the challenging tasks proposed above, we would explore more effective approaches. Besides, there are more meaningful and practical topics that can be addressed with MovieNet, such as movie deoldify, trailer generation, etc.

Notes

1.
IMDb: https://www.imdb.com; TMDb: https://www.themoviedb.org.
2.
https://www.yifysubtitles.com/.
3.
IMSDb: https://www.imsdb.com/; DailyScript: https://www.dailyscript.com/.

References

Arandjelovic, O., Zisserman, A.: Automatic face recognition for film character retrieval in feature-length films. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE (2005)
Google Scholar
Baraldi, L., Grana, C., Cucchiara, R.: A deep siamese network for scene detection in broadcast videos. In: 23rd ACM International Conference on Multimedia, pp. 1199–1202. ACM (2015)
Google Scholar
Bauml, M., Tapaswi, M., Stiefelhagen, R.: Semi-supervised learning with constraints for person identification in multimedia data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013)
Google Scholar
Bhattacharya, S., Mehran, R., Sukthankar, R., Shah, M.: Classification of cinematographic shots using lie algebra and its application to complex event recognition. IEEE Trans. Multimed. 16(3), 686–696 (2014)
Article Google Scholar
Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., Sivic, J.: Finding actors and actions in movies. In: Proceedings of the IEEE International Conference on Computer Vision (2013)
Google Scholar
Bojanowski, P., et al.: Weakly supervised action labeling in videos under ordering constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 628–643. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_41
Chapter Google Scholar
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)
Google Scholar
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Cascante-Bonilla, P., Sitaraman, K., Luo, M., Ordonez, V.: Moviescope: Large-scale analysis of movies using multiple modalities. arXiv preprint arXiv:1908.03180 (2019)
Chasanis, V.T., Likas, A.C., Galatsanos, N.P.: Scene detection in videos using shot clustering and sequence alignment. IEEE Trans. Multimed. 11, 89–100 (2008)
Article Google Scholar
Chen, K., et al.: Hybrid task cascade for instance segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Google Scholar
Chen, K., et al.: MMDetection: open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Del Fabro, M., Böszörmenyi, L.: State-of-the-art and future challenges in video scene detection: a survey. Multimed. Syst. 19, 427–454 (2013). https://doi.org/10.1007/s00530-013-0306-4
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE (2009)
Google Scholar
Deng, Z., et al.: R3net: recurrent residual refinement network for saliency detection. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 684–690. AAAI Press (2018)
Google Scholar
Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: an evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 743–761 (2011)
Article Google Scholar
Duchenne, O., Laptev, I., Sivic, J., Bach, F.R., Ponce, J.: Automatic annotation of human actions in video. In: Proceedings of the IEEE International Conference on Computer Vision (2009)
Google Scholar
Everingham, M., Sivic, J., Zisserman, A.: Hello my name is... buffy - automatic naming of characters in TV video. In: BMVC (2006)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211 (2019)
Google Scholar
Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: Advances in Neural Information Processing Systems, pp. 2121–2129 (2013)
Google Scholar
Giannetti, L.D., Leach, J.: Understanding Movies, vol. 1. Prentice Hall Upper Saddle River, New Jersey (1999)
Google Scholar
Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18(5–6), 602–610 (2005)
Article Google Scholar
Gu, C., et al.: Ava: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)
Google Scholar
Han, B., Wu, W.: Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In: IEEE International Conference on Multimedia and Expo. IEEE (2011)
Google Scholar
Haurilet, M.L., Tapaswi, M., Al-Halah, Z., Stiefelhagen, R.: Naming TV characters by watching and analyzing dialogs. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE (2016)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Huang, Q., Liu, W., Lin, D.: Person search in videos with one portrait through visual and temporal links. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 437–454. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_26
Chapter Google Scholar
Huang, Q., Xiong, Y., Lin, D.: Unifying identification and context learning for person recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
Google Scholar
Huang, Q., Xiong, Y., Xiong, Y., Zhang, Y., Lin, D.: From trailers to storylines: an efficient way to learn from movies. arXiv preprint arXiv:1806.05341 (2018)
Huang, Q., Yang, L., Huang, H., Wu, T., Lin, D.: Caption-supervised face recognition: Training a state-of-the-art face model without manual annotation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society (2008)
Google Scholar
Li, W., Zhao, R., Xiao, T., Wang, X.: Deepreid: deep filter pairing neural network for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 152–159 (2014)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Loy, C.C., et al.: Wider face and pedestrian challenge 2018: Methods and results. arXiv preprint arXiv:1902.06854 (2019)
Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society (2009)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Nagrani, A., Zisserman, A.: From benedict cumberbatch to sherlock holmes: character identification in TV series without a script. BMVC (2017)
Google Scholar
Park, S.B., Kim, H.N., Kim, H., Jo, G.S.: Exploiting script-subtitles alignment to scene boundary detection in movie. In: IEEE International Symposium on Multimedia. IEEE (2010)
Google Scholar
Rao, A., et al.: A unified framework for shot type classification based on subject centric lens. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Rao, A., et al.: A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10146–10155 (2020)
Google Scholar
Rasheed, Z., Shah, M.: Detection and representation of scenes in videos. IEEE Trans. Multimed. 7, 1097–1105 (2005)
Article Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3212 (2015)
Google Scholar
Rotman, D., Porat, D., Ashour, G.: Optimal sequential grouping for robust video scene detection using multiple modalities. Int. J. Semant. Comput. 11(02), 193–208 (2017)
Google Scholar
Shao, D., Xiong, Y., Zhao, Y., Huang, Q., Qiao, Y., Lin, D.: Find and focus: retrieve and localize video events with natural language queries. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 202–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_13
Chapter Google Scholar
Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Meinedo, H., Bugalho, M., Trancoso, I.: Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans. Circuits Syst. Video Technol. 21(8), 1163–1177 (2011)
Article Google Scholar
Simões, G.S., Wehrmann, J., Barros, R.C., Ruiz, D.D.: Movie genre classification with convolutional neural networks. In: 2016 International Joint Conference on Neural Networks (IJCNN). IEEE (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Tapaswi, M., Bäuml, M., Stiefelhagen, R.: knock! knock! who is it? probabilistic person identification in TV-series. In: IEEE Conference on Computer Vision and Pattern Recognition. IEEE (2012)
Google Scholar
Tapaswi, M., Bäuml, M., Stiefelhagen, R.: Story-based video retrieval in TV series using plot synopses. In: Proceedings of International Conference on Multimedia Retrieval, p. 137. ACM (2014)
Google Scholar
Tapaswi, M., Bäuml, M., Stiefelhagen, R.: Aligning plot synopses to videos for story-based retrieval. Int. J. Multimed. Inf. Retrieval 4(1), 3–16 (2015)
Google Scholar
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Vicol, P., Tapaswi, M., Castrejon, L., Fidler, S.: MovieGraphs: towards understanding human-centric situations from videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Wang, H.L., Cheong, L.F.: Taxonomy of directing semantics for film shot classification. IEEE Trans. Circuits Syst. Video Technol. 19(10), 1529–1542 (2009)
Article Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Xia, J., Rao, A., Huang, Q., Xu, L., Wen, J., Lin, D.: Online multi-modal person search in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 174–190. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_11
Chapter Google Scholar
Xiong, Y., Huang, Q., Guo, L., Zhou, H., Zhou, B., Lin, D.: A graph-based framework to bridge movies and synopses. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4592–4601 (2019)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016)
Google Scholar
Xu, M., et al.: Using context saliency for movie shot classification. In: 2011 18th IEEE International Conference on Image Processing, pp. 3653–3656. IEEE (2011)
Google Scholar
Yang, Y., Lin, S., Zhang, Y., Tang, S.: Statistical framework for shot segmentation and classification in sports video. In: Yagi, Y., Kang, S.B., Kweon, I.S., Zha, H. (eds.) ACCV 2007. LNCS, vol. 4844, pp. 106–115. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76390-1_11
Chapter Google Scholar
Zhao, Y., Xiong, Y., Lin, D.: Mmaction (2019). https://github.com/open-mmlab/mmaction
Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: a benchmark. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1116–1124 (2015)
Google Scholar
Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 831–846. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_49
Chapter Google Scholar
Zhou, H., Hermans, T., Karandikar, A.V., Rehg, J.M.: Movie genre classification via scene categorization. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 747–750. ACM (2010)
Google Scholar
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)
Google Scholar

Download references

Acknowledgment

This work is partially supported by the SenseTime Collaborative Grant on Large-scale Multi-modality Analysis (CUHK Agreement No. TS1610626 & No. TS1712093), the General Research Fund (GRF) of Hong Kong (No. 14203518 & No. 14205719), and Innovation and Technology Support Program (ITSP) Tier 2, ITS/431/18F.

Author information

Authors and Affiliations

CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong, Shatin, Hong Kong
Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang & Dahua Lin

Authors

Qingqiu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Anyi Rao
View author publications
You can also search for this author in PubMed Google Scholar
Jiaze Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dahua Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qingqiu Huang .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 18415 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, Q., Xiong, Y., Rao, A., Wang, J., Lin, D. (2020). MovieNet: A Holistic Dataset for Movie Understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12349. Springer, Cham. https://doi.org/10.1007/978-3-030-58548-8_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-58548-8_41
Published: 29 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58547-1
Online ISBN: 978-3-030-58548-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MovieNet: A Holistic Dataset for Movie Understanding

Abstract

Similar content being viewed by others

Condensed Movies: Story Based Retrieval with Contextual Embeddings

MND: A New Dataset and Benchmark of Movie Scenes Classified by Their Narrative Function

iMakeup: Makeup Instructional Video Dataset for Fine-Grained Dense Video Captioning

1 Introduction

2 Related Datasets