Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In this paper we present the current iteration of vitrivr [6], an open-source content-based multimedia retrieval stack. The vitrivr stack is the continuation of the IMOTION system [3, 5, 7, 8] which participated in previous iterations of the Video Browser Showdown [1]. Despite offering some new functionality, the primary focus for this years participation lies in the simplification and generalization of the retrieval stack in order to make it easier to adapt, deploy, and use by both experts and laymen. The vitrivr stack is available in its entirety from https://vitrivr.org.

The remainder of this paper is structured as follows: Sect. 2 provides a brief overview of the overall system architecture and Sect. 3 summarizes all query types supported by vitrivr. Section 4 provides details on the functionalities introduced in the current version. In Sect. 5, we briefly outline our reasoning behind the open sourcing of vitrivr and Sect. 6 concludes.

2 Architectural Overview

The vitrivr stack – like its predecessor IMOTION – consists of three primary system components: the storage layer \(\textsf {ADAM}_{{pro}}\) [2], the retrieval engine Cineast [4], and a browser-based user interface. Additionally, a web server is used to serve static content such as videos and thumbnail images. Additional details on the architecture of the entire stack can be found in [6].

3 Interaction and Query Types

The vitrivr stack offers various ways in which queries can be specified. Basically, they can for the most part be grouped into two categories: visual and textual. The visual query modes include Query-by-Sketch and Query-by-Example as well as Relevance Feedback which are based on visual input such as user generated sketches of a scene or one or multiple previously retrieved scenes. These queries are performed based on data extracted directly from the video frames. The textual queries are based on information which can be extracted from the video content and represented as text, such as spoken language, text on screen, or the provided textual video meta data. For this we use the ASR data provided with the video data set as well as several object detectors to produce labels for the shots. OCR is applied in order to make text which might appear on screen searchable as well.

4 New Functionality

While the IMOTION system that has participated in previous instances of VBS has always been a specialized piece of software, purpose-built for the competition, the functionality we added to vitrivr in preparation for this iteration of VBS are such that they are also useful for other use cases of vitrivr.

4.1 New User Interface

The most salient difference to the IMOTION System of the previous year is the new user interface. While still browser-based, the latest iteration of the UI is based upon the Angular frameworkFootnote 1. Its modular structure makes it easy to customize the entire UI or parts thereof to shift its focus from general purpose multimedia retrieval to, in this case, competitive video retrieval.

As in the past, the UI enables result streaming in order to be able to already present partial results to the user while the query is still being processed by the backend. It, however, achieves this no longer via AJAX requests but rather uses a WebSocket connection to Cineast. A REST-API is also available. The stack still includes a web server which provides the static content such as shot thumbnails and the videos themselves, but it is no longer required to act as a proxy between the browser and Cineast. The screenshot in Fig. 1 depicts the current version of the UI.

Fig. 1.
figure 1

Screenshot of the vitrivr UI for large-scale competitive video search

4.2 Approximate Retrieval

The underlying storage engine \(\textsf {ADAM}_{{pro}}\) [2] supports multiple index structures for efficient vector space retrieval. Many of these index structures achieve their high efficiency by approximating results rather than producing the true nearest neighbors of a query vector. In previous system iterations, we only made use of exact query results which lead to longer query times. In the current iteration of the system, the choice as to whether exact or approximate queries should be used can be made at query time. Hence, the user can sacrifice some accuracy to gain major speed-ups.

5 Open Source

The entire vitrivr stack [6] is published under the MIT license, the source code of all its components is available from their individual GitHubFootnote 2 repositories, additional documentation can be found on https://vitrivr.org. Being a general-purpose multimedia retrieval stack, vitrivr has many applications outside of competitive video retrieval as it also supports other domains such as Images, Audio, and 3D-models. With this flexible open source stack, we hope to offer the community the basis for future research in many areas and domains of multimedia retrieval.

6 Conclusions

With the competitive video retrieval version of vitrivr, we plan to continue the successful participations with we had with the IMOTION system in the past. It is our hope that by publishing the entire retrieval stack as open source software, we lower the entry hurdle for future participants and provide some acceleration for the testing of new ideas in the context of large-scale video retrieval.