1 Introduction

Over recent years, video has become a significant portion of the overall data which populates the web. This has been due to the fact that the production and distribution of video has shifted from a complex and costly endeavor to something accessible to everybody with a smart phone or similar device and a connection to the internet. This growth of content enabled new possibilities in various research areas which are able to make use of it. Despite the access to such large amounts of data, there remains a need for standardized datasets for computer vision and multimedia tasks. Multiple such datasets have been proposed over the years. A prominent example of a video dataset is the IACC [5] which has been used for several years now for international evaluation campaigns such as TRECVid [2]. Other examples of datasets in the video context include the YFCC100M [8] which, despite being sourced from the photo-sharing platform FlickrFootnote 1, contains a considerable amount of video material, the Movie Memorability Database [4] which is comprised of memorable sequences from 100 Hollywood-quality movies or the YouTube-8M [1] dataset which in contrast, despite being sourced from YouTubeFootnote 2, does not contain the original videos themselves. The content of all of these collections does, however, differ substantially from the type of web video commonly found ‘in the wild’ [7].

In this paper, we present the Vimeo Creative Commons Collection or V3C for short. It is composed of 28’450 videos collected from the video sharing platform VimeoFootnote 3. Apart from the videos themselves, the collection includes meta and shot-segmentation data for each video, together with the resulting keyframes in original as well as reduced resolution. The objective of V3C is to eventually complement or even replace existing collections in real-world video retrieval evaluation campaigns and thus to tailor the latter more to the type of video that can be found on the Internet.

The remainder of this paper is structured as follows: Sect. 2 gives an overview of the process of how the collection was assembled and Sect. 3 introduces the collection itself, its structure and some of its properties. Finally, Sect. 4 concludes.

2 Collection Process

The requirements for usable video sources from which to compile a collection were as follows:

  • The platform must be freely accessible.

  • It must host a large amount of diverse and contemporary video content.

  • At least a portion of the content must be published under a creative commonsFootnote 4 license and can therefore be redistributed in such a collection.

Two candidates for such collections are Vimeo and YouTube. Vimeo was chosen over YouTube because while YouTube offers its users the possibility to publish videos under a creative commons attribution license which would allow the reuse and redistribution of the video material, YouTube’s Terms of Service [9] explicitly forbid the download of any video on the platform for any reason other than playback in the context of a video stream.

We utilized the Vimeo categorization system for video collection. Videos are placed in 16 broad categories, which are further divided into subcategories. Videos in each category were examined to determine if they satisfied the ‘real world’ requirements for the collection. Four top level categories were included in the collection, while 3 were excluded. For the remaining 9 categories, only some subcategories were included. The following are the 4 categories completely included in the collection: ‘Personal’, ‘Documentary’, ‘Sports’ and ‘Travel’.

An overview of the excluded categories can be seen in Fig. 1. Categories that had very low visual diversity (such as ‘Talks’), or did not represent real world scenarios were removed. Categories (or subcategories) with a lot of animation/graphics, or non standard content with little or no describable activity were excluded from the collection. Videos from the selected categories were then filtered by duration and license.

The obtained list of candidate videos was downloaded from Vimeo using an open-source video download utilityFootnote 5. The download was performed sequentially in order to not cause unnecessary load on the side of the platform. All downloaded videos were subsequently checked to ensure they could be properly decoded by a commonly used video decoding utilityFootnote 6.

The videos were segmented and analyzed using the open-source content-based video retrieval engine Cineast [6]. Videos with a distribution of segment lengths which were sufficiently different from the mean were flagged for manual inspection as this indicated either very low or very high visual diversity as in the cases of either mostly static frames or very noisy videos. During this step, videos were also checked to ensure that the collection does not contain exact duplicates.

Out of the remaining videos, three subsets with increasing size were randomly selected. Sequential numerical ids were assigned to the selected videos in such a way that the first id in the second part is one larger than the last id in the first part and so on, in order to facilitate situations in which multiple parts are to be used in conjunction.

Fig. 1.
figure 1

Removed categories and subcategories are emphasized.

3 The Vimeo Creative Commons Collection

The following provides an overview of the structure as well as various technical and semantic properties of the Vimeo Creative Commons Collection.

3.1 Collection Structure

The collection consists of 28’450 videos with a duration between 3 and 60 min each and a total combined duration of slightly above 3’800 h, divided into three partitions. Table 1 provides an overview of the three partitions. Similar to the IACC, the V3C also includes a master shot reference which segments every video into sequential non-overlapping parts, based on the visual content of the videos. For every one of these parts, a full resolution representative key-frame as well as a thumbnail image of reduced resolution is provided. Additionally, there are meta data files containing both technical as well as semantic information for every video which was also obtained from Vimeo.

Table 1. Overview of the partitions of the V3C

Every video in the collection has been assigned a sequential numerical id. These ids are then used for all aspects of the collection. Figure 2 illustrates the directory structure which is used to organize the different aspects of the collection. This structure is identical for all three partitions. The info directory contains one json-file per video which holds metadata obtained from Vimeo. This metadata contains both semantic information – such as video title, description and associated tags – as well as technical information including video duration, resolution, license and upload date. The msb directory contains for each video a file in tab-separated format which lists the temporal start and end-positions for every automatically detected segment in a video. The keyframes and thumbnails directories each contain a subdirectory per video which hold one representative frame per video segment in a PNG format. The keyframes are kept in the original video resolution while the thumbnails are downscaled to a width of 200 pixels. Finally the videos directory contains a subdirectory per video, each of which containing the video itself as well as the video description and a file with technical information describing the download process.

Table 2. Overview of the detected languages in the video title and description of the V3C in percent

3.2 Statistical Properties

The following presents an overview of the distribution of selected categories throughout the collection.

The age distribution of the videos of the entire collection as determined by the upload date of the individual video is illustrated in Fig. 3. It is shown in comparison to the distribution originally presented in [7] for a large sample of Vimeo in general. The trace representing the V3C is less clean than the one for the Vimeo dataset due to the large difference in number of data points. It can however still be seen that both traces have a similar overall shape, at least for the parts of the plot where there is data available for both. Other than the Vimeo dataset from [7], the collection of which was completed mid 2016, the V3C includes videos from as late as early 2018 which explains the difference in shape towards the right side of the plot.

Fig. 2.
figure 2

Directory structure of the V3C

Fig. 3.
figure 3

Daily relative video uploads from the V3C and the Vimeo dataset

The distribution of video duration and resolution is shown in Figs. 4 and 5 respectively, again in comparison to the larger Vimeo distributions. It can be seen that wherever there were no additional restrictions, the properties of the V3C follow those of the overall Vimeo dataset rather closely. At least in terms of these three properties, the V3C can therefore be considered reasonably representative of the type of web video generally found on Vimeo.

Fig. 4.
figure 4

Scatter plot showing the duration of videos from the V3C and the Vimeo dataset

Fig. 5.
figure 5

Distribution of video resolutions in the V3C

An overview of the languages detected by the same method as employed in [7], based on the title and description of the videos can be seen in Table 2. It shows the top-10 languages for either the V3C or the dataset from [7]. The column labeled ‘?’ represents the instances where language detection did not yield any result. It can be seen that for the videos, the titles and descriptions of which were distinct enough for language detection, the distribution within the V3C is similar to the Vimeo dataset. No language analysis based on the audio data of the videos has been performed yet.

Table 3 shows the categories and the number of videos per collection part which have been assigned to a particular category on Vimeo. Every video can be assigned to multiple categories, the numbers shown in the table do therefore not sum to the total number of videos. Despite the categories having a structure which implies a hierarchy, a video can be assigned to both a category and subcategory, but it does not have to. The large number of used categories shown in the table implies a wide range of content which can be found in the collection.

3.3 Possible Uses

Due to the large diversity of video content contained within the collection, it can be useful for video-related applications in multiple areas. The large number of different video resolutions – and to a lesser extent frame-rates – makes this dataset interesting for video transport and storage applications such as the development of novel encoding schemes, streaming mechanisms or error-correction techniques.

Its large variety in visual content makes this dataset also interesting for various machine learning and computer vision applications.

Finally, the collection has applications in the area of video analysis, retrieval and exploration. For example, we can imagine four possible application areas in the video retrieval space. First, video tagging or high-level feature detection where the goal is given a video segment or shot, the system should output all the relevant tags and visual concepts that are in this video. Such a task is very fundamental to any video search engine that tries to match users search queries with video dataset to retrieve the most relevant results. Second, ad-hoc video search where a system takes as input a user text query as a natural language sentence and returns the most relevant set of videos that satisfies the information need in the query. Such a task is also necessary for any search system that deals with real users where it has to understand the user query and intention before retrieving the set of results that matches the text query. Third, trying to find a video or a video segment which one believes to have seen but the name of which one does not recall is often called “known item search”. Queries are created based on some knowledge of the collection such that there is a high probability that there is only one video or video segment that satisfies the search. Fourth, the application of video captioning or description in recent years gained a lot of attention. Here the idea is how can a system describe a video segment in a textual form that contains all the important facets such as ‘who’, ‘what’, ‘where’, ‘when’ so essentially textual summary of the video. As the V3C collection includes a master shot boundary splitting a whole video into smaller shots, the video captioning task can be run on those small video shots as currently the state of the art can not handle longer videos and give a logical and human readable description for the whole video in textual form.

Table 3. Category assignment per video and collection part

3.4 Availability

We are planning to launch and make available this collection at the 2019 TRECVid video retrieval benchmark where different research groups participate in one or more tracks. In addition, the collection will be shared at the Interactive Video Browser Showdown (VBS) [3] which collaborates with TRECVid organizing the Video Ad-hoc Search track. The collection will be available to the benchmark participants as well as the public for download. After the annual benchmark cycle is concluded, we will also provide the ground truth judgments and queries/topics for the tasks that used the V3C collection so that research groups can reuse the dataset in their local experiments and reproduce results.

4 Conclusions

In this paper, we introduced the Vimeo Creative Commons Collection (V3C). It is comprised of roughly 3’800 h of creative commons video obtained from the web video platform Vimeo and is augmented with technical and semantic metadata as well as shot boundary information and accompanying keyframes. V3C is subdivided into three partitions with increasing length from roughly 1’000 h up to 1’500 h so that the collection can be used for at least three consecutive years in a video search benchmark with increasing complexity. Information on where to download the V3C collection and/or its partitions will be made available together with the publication of the video search benchmark challenges.