Keywords

1 Introduction

With the ever-growing volume of multimedia data, means for efficient and effective search in such multimedia collections are a necessity. At the annual Video Browser Showdown (VBS) [15] – a competition-style evaluation campaign in the domain of interactive video search – interactive multimedia retrieval systems compete against each other in a pseudo-realistic setting. Particularly, Known-Item Search (KIS) and Ad-hoc Video Search (AVS) are task categories of VBS [8]. The former category consists of two sub-categories, each with a single target video shot of about 20 s with textual and visual hints, respectively. The latter features broader query terms without known ground truth, and human judges decide whether a shot meets the task criteria or not. VBS operates on the Vimeo Creative Commons Collection (V3C) [14], in particular V3C’s first and second shards [1, 13], culminating in approximately 2300 h video content. In addition, a new, more homogeneous “marine video” dataset [17] of more than 11 h complements previous video data for more challenging tasks.

Inspired by vitrivr [3, 6], a long-running participant at VBS, we present in this paper 4MR (Multi-Modal MultiMedia Retrieval), a novel open-source multi-modal multimedia retrieval system with a focus on retrieval blocks. In particular, the 4MR system empowers users to express multi-modal queries and to freely combine these with Boolean logic. Like the top-ranked teams from previous iterations [7] that have implemented CLIP [10], we also rely on CLIP as the primary feature extraction method. Furthermore, we support deep learning-based OCR and ASR methods and extracted concept labels with ResNet50 [5].

The remainder of this paper is structured as follows: Sect. 2 introduces the concepts of retrieval blocks and how to formulate queries, Sect. 3 gives insights on the implementation, and Sect. 4 concludes.

2 Retrieval Blocks

In order to efficiently and effectively search in video collections, three search paradigms have proven to be successful [7]: (i) (extended) Boolean search, (ii) vector-based text search, (iii) kNN search. We employ all three paradigms and build multi-modal queries on them: Boolean search in order to efficiently use provided metadata such as collection affiliation. We plan to use this feature to specify V3C shards one and two or the Marine Video Kit. However, vector search is used to search in deep learning-based extracted feature data such as CLIP representations. In addition, textual vector search is applied to search in OCR data, for example.

At its core, our multi-modal query model consists of so-called Query Statements. A Query Statement is the smallest unit in our model and might be one of the following: (i) textual search, either Boolean or kNN search, (ii) metadata search, using Boolean search (iii) visual search, using kNN search.

Two Query Statements can be linked with a so-called Statement Linking Operator, either the logical AND or OR. One or two Query Statements form a Query Block with or without a Statement Linking Operator. Query Blocks can subsequently be arbitrarily nested, and their relationship to each other is also defined by Statement Linking Operators. Ultimately, one or more Query Blocks are called Retrieval Statement, put into relation with the so-called Retrieval Linking Operator. Retrieval Linking Operators define an order so that the prerequisites of one Retrieval Statement must be met first before the next one can be applied, effectively creating stages. Another such Retrieval Linking Operator could represent some temporal relation to the Retrieval Statements.

This nesting of Query Blocks and their defined relation to one another is sufficient to formulate arbitrarily complex queries as needed for interactive search. While the model would allow for interactive search within text-based and inherently multi-media collections, this functionality cannot be used in VBS as the competition is video search only. However, we intend to use the Query Blocks in order to express complex queries.

3 Implementation

The 4MR system follows a three-tier architecture [16], as shown in Fig. 1. The storage layer consists of a Postgres databaseFootnote 1 (textual and Boolean search) for textual information, a CottontailDB database [2] for vectors (and kNN search) and the file system to store the actual media files. Multiple storage systems are used to exploit their strengths: Postgres for efficient Boolean search and CottontailDB for kNN search. In what follows, we describe the retrieval engine (Sect. 3.1), explain which features we search for (Sect. 3.2) and finally introduce the user interface (Sect. 3.3).

Fig. 1.
figure 1

Architecture diagram of the 4MR system.

3.1 Retrieval Engine

Written in Kotlin, the retrieval engine communicates with the storage tier using a JDBC adapter and a gRPC client for Postgres and CottontailDB, respectively. The entire parsing of queries, as well as relaying the appropriate parts to the underlying storage systems in correspondence to the query type, is handled in the retrieval engine. Ultimately, the results of the storage components are fused into a single ranked result list and sent to the user interface. All functionality of the retrieval engine is made accessible to the front end by a REST API.

3.2 Video Analysis

Due to the success of deep learning-based video analysis methods, or feature extraction, our system employs four such models. In particular, we use Contrastive Language-Image Pre-training, better known as CLIP, based on recognising image elements using our natural language to describe them [9,10,11]. It builds on a large body of work on zero-shot transfer, natural language supervision, and multi-modal learning. We use CLIP, or more precisely the “ViT-B/32”-model from Open-AI, to provide visual concept and similarity search. The visual concept search encodes the text input with the model. The resulting vector is then used for a kNN search, where the best fitting images with small distances are returned. In the concept of similarity search, kNN search is directly performed on the 512-feature-long vectors. Besides CLIP, a video’s textual and audio content is often interesting for queries. We support optical character recognition (OCR) queries in these domains and automatic speech recognition (ASR), respectively. The features used for these features are the same as presented by our inspiration, vitrivr [12] and rely on text search entirely.

Last but not least, we support the notion of concepts recognised within videos, so-called tags. Using a residual network with 50 layers, ResNet50 [4], we retain tags and the confidence that the neural net classified them, which we can use in the query formulation.

3.3 User Interface

Heavily inspired by vitrivr’s frontend vitrivr-ngFootnote 2, we also have an AngularFootnote 3 based frontend divided into a query sidebar and central result presentation area (see Fig. 2).

Fig. 2.
figure 2

A screenshot of the 4MR system in action. The left features the query formulation area and the center is used for result presentation.

4 Conclusion

We introduce 4MR, a new system focusing on query formulation based on vitrivr, to participate in the Video Browser Showdown 2023. The contribution is twofold: On one hand, we describe a query formulation concept which empowers users to combine query blocks and freely define their relationship. On the other hand, we provide an implementation of the concept in order to be able to evaluate our system in the competitive setting of VBS. Our query formulation methodology uses deep learning-based features, particularly CLIP, like the top-ranked teams in the last instances of VBS.