1 Introduction

VERGE is an interactive video retrieval system that provides various search capabilities inside a set of images, along with a Web Application for creating and running queries on the set, viewing the top results and submitting the appropriate ones. Over the past few years VERGE has participated many times in the Video Browser Showdown (VBS) competition [9] trying each time to better adapt to the competition’s “Ad-Hoc Video Search” (AVS) and “Known Item Search - Visual/Textual” (KIS-V, KIS-T) tasks. This year the various search modules have been further improved and the Web Application has been updated in order to be even more user-friendly and fast-to-use.

The paper is structured as follows: Sect. 2 describes the overall framework of the system, Sect. 3 continues with a detailed description of the various retrieval modalities, Sect. 4 presents the user interface (UI) and its features, and the paper wraps up with Sect. 5 that briefly describes the future work.

2 The VERGE Framework

As shown in Fig. 1, the VERGE framework is composed of three layers. The first layer contains all the retrieval modalities that are applied on the datasets, i.e. V3C1, V3C2 [18] and the Marine Video Dataset. The outcomes are stored in a database (except for the ones from Text to Video Matching module that only runs on-the-fly). The second layer consists of the various services that accept queries and return the top results. The third layer is the Web Application that allows users to formulate and send queries, connects to the corresponding services and displays the results.

Fig. 1.
figure 1

The VERGE framework

3 Retrieval Modalities

3.1 Concept-Based Retrieval

This module annotates each keyframe of a shot with labels from a pool of concepts, which comprises 1000 ImageNet concepts, a selection of 298 concepts of the TRECVID SIN task [14], 500 event-related concepts, 365 scene classification concepts, 580 object labels as well as 22 sports classification labels. To obtain the annotation scores for the 1000 ImageNet concepts, we used an ensemble method, averaging the concept scores from two pre-trained models that employ different DCNN architectures, i.e. the EfficientNet-B4 [20] and InceptionResNetV2. To obtain scores for the subset from the TRECVID SIN task, we trained and employed a model based on the EfficientNet-B4 architecture on the official SIN task dataset. For the event-related concepts, we used the pre-trained model of EventNet [7]. Regarding the extraction of the scene-related concepts, we utilized the publicly available VGG16 model, fine-tuned on the Places365 dataset. Object detection scores were extracted using models pre-trained on the MS COCO and Open Images V4 datasets, with 80 and 500 detectable objects, respectively. To label sports in video frames, we constructed a custom dataset with Web images from sports and utilized it to train a model of the EfficientNetB3 architecture. Finally, to offer a cleaner representation of the concept-based annotations we employed the sentence-BERT [17] text encoding framework, to measure the text similarity between all concepts’ labels. After inspecting the results, we manually formed groups of very similar concepts for which we created a common label and assign the max score of its members.

3.2 Spatio-temporal Activity Recognition

The activity detection and recognition module extracts human-related activities for each shot to enrich the filtering functionalities using the labels of the activities. A list of 400 pre-defined human related-activities and the corresponding scores were extracted for each shot using a 3D-CNN architecture. Especially, the configuration of the 3D-ResNet architecture with 152 layers according to [8] was used and the model’s pre-trained weights were learned using the Kinetics-400 dataset [3]. During inference, the input shots fed to the model were pre-processed to be descended to the model’s input dimension equal to \(16 \times 112 \times 112 \times 3\).

3.3 Visual Similarity Search

The visual similarity search module uses as input the visual features of each shot and retrieves the most similar content using DCNNs. These features are the output of the last pooling layer of the fine-tuned GoogleNet architecture [15] and are used for globally representing the images. In order to allow fast and efficient indexing, an IVFADC index database vector is developed with these vectors [10].

3.4 Human and Face Detection

The human and face detection module aims to detect and count humans and human faces in each keyframe of each shot, so that the user can easily distinguish the results of single-human or multi-human activities. The detection of both human silhouettes (bodies) and human faces (heads) were extracted using a DCNN architecture, YoloV4 [1]. The model’s initial weights are learned using the MS COCO [12] dataset and fine-tuned using the CrowdHuman dataset [19] that consists of crowd-center scenes where partial occlusions among humans or between humans and objects are possible to occur. During inference, the total number of humans and human heads is calculated and counted only in case the detected bounding box area is larger than a predefined threshold.

3.5 Text to Video Matching Module

The text-to-video matching module inputs a complex free-text query and retrieves the most relevant video shots from a large set of available video shots. We utilize the T\(\times \)V model, a cross-modal video retrieval method presented in [6]. The T\(\times \)V model utilizes multiple textual and visual features along with multiple textual encoders to build multiple cross-modal common latent spaces. The network is trained using video-caption pairs and learns to transform these pairs into multiple common latent spaces. The straightforward comparison between a video and a caption is possible in every individual space. Using a multi-loss-based training approach our network learns the overall similarity by optimizing the individual similarities. Regarding the training, a combination of four large-scale video caption datasets is used (e.g. MSR-VTTT [22], TGIF [11], ActivityNet [2] and Vatex [21]), and the improved marginal ranking loss [4] is used to train the entire network. As initial video shot representation, we utilize three different trained networks, i) the ResNet-152 deep network trained on the ImageNet-11k dataset, ii) the ResNeXt-101 network, pre-trained by weakly supervised learning on web images followed and fine-tuned on ImageNet [13], and iii) the CLIP model (ViT-B/32) [16]. As textual encoders, the networks utilize i) a feedforward encoder utilizing the CLIP-based generated embeddings, and ii) the textual sub-network (ATT) presented in [5].

3.6 Concept-Based Late Fusion

The concept-based late fusion module returns a list of shots, where each shot contains all the queried concepts. Specifically, the method uses two or more visual concepts (Sect. 3.1) and produces a sorted shots’ list via a late fusion method. First, for each concept an independent list of shot probabilities \(P = \{p_i\}_{i=1}^{i=n}\) is developed. The intersection of the concepts at shot layer is calculated and is sorted using the objective function

$$\begin{aligned} {f(P) = \sum _{i = 1}^{|P|} e^{-p_i} + \sum _{i,j = 1, i \ne j}^{|P|} e^{-|p_i - p_j|}}. \end{aligned}$$
(1)

This function follows the assumption that the higher the concept probabilities or the more relevant the shots are, the higher their scores are. We deal with it using the difference of the probabilities for all concept pairs’ combinations followed by the inverse exponential function to them.

3.7 Temporal Late Fusion

The temporal late fusion module returns a list of tuples of shots, where each element of the list corresponds to the same video and contains all the queried concepts with respect to the given order. In particular, the module incorporates two or more visual concepts (Sect. 3.1) and produces a sorted list without duplicates via a late fusion method. At first, a list of shot probabilities is produced for each concept. Next, the intersection of concepts at video layer is calculated and the first tuple of each video, which respects the ordering of the concepts, is kept. The shots are sorted using an objective function that respects the same assumptions identified in the concept-based late fusion method (Sect. 3.6).

4 User Interface and Interaction Modes

The VERGE UI (Fig. 2) is a Web application that allows a user to easily create and run queries on the dataset and view the top results using the modalities that were described in the previous sections. They can also watch the corresponding video and submit the appropriate data during the VBS competition. The goal of the VERGE UI is to provide a user-friendly, compact, effective and fast tool for searching in image collections.

Fig. 2.
figure 2

The VERGE user interface (Color figure online)

The UI has two main parts: the menu on the left and the results panel on the right. On the top of the menu there is a timer that counts down the remaining time for submission during a VBS task. Below there is a slider where a user can define the size of the images on the results panel, an undo button for restoring previous results, a rerank button for reranking the current results based on the next query, and then follow the various search modules. The first module is the free text search (Sect. 3.5) where the user can type anything in the form of free text and the second one allows the user to search from a list of pre-extracted concepts and activities (Sects. 3.1, 3.2). Multiple selection is also supported for late fusion (Sect. 3.6) as well as temporal fusion (Sect. 3.7), if the corresponding checkbox is checked. Search by the color of the image is possible by coloring a 3\(\,\times \,\)3 grid. Finally, there are search options based on the number of people or faces (Sect. 3.4) visible in an image.

The results panel contains the top results in a grid view. When an image is clicked, a pop-up panel appears that shows all the available shots of the corresponding video. Hovering over an image, three buttons appear. One on the bottom-left corner of the image that, when clicked, returns visually similar images (Sect. 3.3), one on the bottom-right corner for submitting this shot and one on the top-right corner that plays the respective video. Under the video player there is a button to submit directly the time of the video.

To demonstrate the features of VERGE, we shortly describe three use cases. For an AVS task that asks for shots of a single person playing guitar, the user can select the concept “playing guitar” and rerank the results by selecting only one person to appear. For a KIS-V query that searches for a video that shows a grassland on the bottom and the white sky on the top, the user can utilize the search by color, painting the first row white and the third row green (Fig. 2). Lastly, for a KIS-T query that asks for a video that shows “a man behind bars taking a cookie from a tray held” the exact words can be used per se in the free-text search.

5 Future Work

Every year we try to improve the retrieval performance by increasing the response time and the effectiveness of the algorithms, as well as to make the VERGE UI more intuitive, user-friendly and fast. The whole system will be evaluated in the VBS 2023 competition and the participating experience will drive the future steps, regarding both the search methodologies and the UI.