1 Introduction

While video portals like YouTube can easily grow by hundreds of hours of video content per minute, there is still a shortage of effective video retrieval models providing access to contents of huge volumes. Whereas for domain-specific data, it is possible to collect large training datasets and try to train an effective model, for general domain datasets, it is hard to build a universal search approach. Nevertheless, there are attempts to train models with huge volumes of training pairs (item; text description) [1, 36, 45, 62]. Despite these efforts, it is worth noting that even with a perfect cross-modal text-video search model there are other limitations affecting search effectiveness. Three important factors are limitations of human memory (i.e., ability to remember all details), language skills and density of items in the dataset (i.e., ability to identify a correct item in a larger cluster).

The analysis of annual reports from TRECVID [57], Video Browser Showdown (VBS) [31], or Lifelog Search Challenge [25] shows that there still exist many practical situations where state-of-the-art retrieval approaches do not provide sufficient results even for far smaller datasets. Hence, it is essential to continue with the development of new video retrieval models/tools as well as with evaluation efforts providing important performance insights.

This paper presents a thorough analysis of the 11\(^{th}\) instance of the Video Browser Showdown competition with participating tools combining interactive search interfaces and ranking approaches based on deep machine learning models. The competition setting consists of a known large video dataset (V3C [66]) and evaluation methodology allowing fair comparison of participating systems. During the competition, all the teams have access to a distributed evaluation server [64], where competition tasks are presented to participating teams at the same time and with the same time limit. In particular, three task categories are evaluated [52]:

  • Visual known-item search (KIS-v), where teams observe a target video segment from the collection. No meta-data is provided, no cameras are allowed.

  • Textual known-item search (KIS-t), where teams receive a text description of a target video segment. The text is gradually extended.

  • Ad-hoc video search (AVS) tasks introduced with a short text description, where teams have to submit as many correct shots (matching the description) as possible.

A known-item search task is considered as solved by a team once the team submits a correct frame/shotID from the target segment. Incorrect submissions result in a penalty deducted from the score. The evaluation server knows the target segment and thus evaluates submissions in known-item search tasks automatically. In the ad-hoc search category, teams submit all items where team members think the item is correct. Since the ground truth is unknown for the whole collection, live judges are necessary to assess the submissions. Teams receive points for each correct submission (merging temporally close submissions into ranges counted only once), independent of whether other teams also found the same segment. The scoring is described in more detail in [31].

With these settings, the Video Browser Showdown hosted sixteen teams during the 2022 International Conference on Multimedia Modeling in Vietnam, where several teams (or members) participated remotely. All teams introduced a unique video search tool and some teams also implemented logging mechanisms. Hence, we collected a non-trivial amount of data from the competition which allows us to present the following key contributions, each in a separate section:

  • A broad survey of multimedia search models and approaches participating in the 11th VBS.

  • Overall summary of the competition results, showing success rate, submission times, and numbers of submissions.

  • Thorough result log analysis of selected teams, revealing performance insights as well as query statistics.

  • Analysis of ad-hoc search category, showing timeline statistics and also a new revision of the task category.

  • Query specification methodology in connection with a qualitative study.

The last section concludes the paper and envisions future settings of the Video Browser Showdown.

2 Related work used by participating systems

VBS 2022 hosted many participating systems, each implementing different ranking models and search methods. A general overview of the systems and the approaches they employed is presented in Table 1. This section further summarizes important or unique methods used by each participant. For additional detail about any system, please see the corresponding publication referenced beside the system name in the overview table.

Table 1 List of participating teams and video search approaches

2.1 Concept search

This section summarizes the concept-based search approaches utilised by participating systems, including concepts detected for the whole image as well as localized information obtained from object detectors or semantic segmentation.

VISIONE [4], as in previous years [2, 3, 5], supports queries by object location appearing in a target scene. That is done by drawing simple diagrams on a canvas to specify objects (including their spatial locations). The object detection technique of VISIONE is based on three pre-trained DCNN models (i.e., VfNet [80], Mask R-CNN [27], Faster R-CNN [22]) for a total of 1460 object classes. Similarly, VERGE employs three different DCNN models (i.e., EfficientNet-B3, EfficientNet-B5 [74] for label-based search of video shots and InceptionResNetV2 [73]) for keyframes’ enrichment with concept [56] and object [79] annotations.

IVIST [42], AVSeeker [41], diveXplore [43] use MS-COCO [46]. In particular, IVIST adopts an HTC [14] object detection model which is pre-trained on MS-COCO and supports an object query function to filter the frames which do not contain the query object categories. In contrast, AVSeeker [41] indexes object concepts on all keyframes into an Elasticsearch node, using categories from MS-COCO (detected by YOLOv4). This allows users to search for concepts using advanced query formulations such as customized AND/OR operators, fuzzy matching, negation, etc. These are provided by the “Query String Query” of Elasticsearch. Similarly, diveXplore [43] provides object search for object categories from MS-COCO (detected by YOLOv5).

VERGE [6] involves spatio-temporal human activity recognition using a 3D-CNN architecture. This construction relies on a three-step pipeline [24]: object detector, object tracker, and activity recognizer to identify human-related activities effectively.

As before, VISIONE [4] supports queries by color location [2, 3, 5]. A user can draw simple diagrams on a canvas corresponding to colors appearing in a target scene. For the color annotation, two chip-based color naming techniques [11, 76] are employed. vitrivr maintains similar color-based sketches, as in previous iterations of the system [29, 30, 65, 67].

VIREO [55] supports color sketch queries using a grid of 48 cells (6x8), which users can individually fill with colors corresponding to their search target. Similarly, VERGE [6] maintains color-based queries using a grid of 9 cells (3x3). Video clips matching the colors in these positions are then assigned to a higher rank.

IVIST [42] enables a color query function to find frames depending on whether their top 3 dominant colors are included in the query colors or not. VideoFall [60] also affords users with color search based on specific dominant colors. The annotation for dominant colors in VideoFall, specifically 12 basic colors, is captured by integrating a k-means clustering algorithm to the set of frame-pixel values.

vitrivr [28] supports semantic-based sketches, as in previous iterations of the system. Semantic sketches are based on a DeepLab segmentation model [15] utilized as described in [65].

diveXplore [43] provides concept-based search for concepts in ImageNet-1000 [18] and Places365 (detected by EfficientNet-B2 [74]). Also, Exquisitor and ViRMA both support concept search for 12,988 ImageNet concepts, which were extracted for each keyframe using a pre-trained DCNN ResNet model [59]. To support its browsing and data model in VR, ViRMA further organises these concepts into a hierarchical structure using semantic relationships derived from WordNet [19]. Finally, in addition to the ImageNet concepts, Exquisitor also maintains search for activity concepts of Kinetics-700, extracted from the video shots using a pre-trained 3D-ResNet model [26].

VIREO [55] allows users to perform search using a bank of 16,263 concepts. These are extracted by the concept decoder of the dual task model [78].

VideoFall [60] affords users with textual search and the visible textual information in the frames is extracted using Google Vision API.Footnote 1

2.2 Embedding

In this section, we discuss joint embedding approaches, which combine text and image/video processing architectures with the objective of mapping the same semantic information to similar vectors.

The top three scoring systems, vibro, CVHunter and VISIONE, including a large number of other systems (AVSeeker, V-FIRST, VNUHCM and AIClub@UIT) all use networks derived from CLIP [62]. The text query phrases are transformed into a joint text-image vector space with cosine similarity. Specifically, VISIONE uses TERN [58] for text-image retrieval.

In addition to this, VISIONE uses CLIP2video [20] for text-video retrieval. Similarly, IVIST and VideoFall use networks based on CLIP too [62], where input text queries are matched with videos in a joint text-video vector space.

VERGE’s [6] text-video matching module translates a complex textual query and the videos into a joint latent space for direct comparison. Next, it utilizes the attention-based dual encoding network [21]. In contrast, VIREO [55] uses the dual-task model [78] for the same text-video retrieval task.

vitrivr and vitrivr-VR both rely on a custom visual-text co-embedding model [72] inspired by approaches like W2VV++ [44]. In comparison to CLIP-based approaches, the embedding models are much simpler, resulting in lower hardware requirements.

2.3 Temporal querying

Since VBS tasks can comprise longer target video sequences (up to 20 s), some systems can address multiple items in the target sequence at once using a temporal query.

vibro [33] employs a two-tab system in order to enable temporal queries. Each tab can formulate queries of the supported modalities and produces an individual ranked order of keyframes. If both tabs contain a query and a result list, consecutive sequences of keyframes from a single video are ranked according to the probability that the sequence contains content from the first tab's result list followed by content from the second tab in an adjustable time range.

CVHunter [49] supports two options to address a sequence of video segments: a context-aware ranker that supports unordered specification of target segments, and its special case, temporal query [50], where query parts are ordered in the same way as the searched sequence of segments. Both approaches require distances from all query parts to all selected frames. However, it is worth noting that based on the VBS log analysis, the context-aware ranker was rarely used.

VISIONE [4] uses a time quantization approach to support temporal queries, where each video is divided into intervals of 7 s. Given two queries, the temporal search is performed in two steps. First, the two queries are processed independently, and for each query, just the result with the maximum score is kept for each time interval. Second, the results of the two queries that are temporally close are then combined into pairs, and just a sample of distinctive pairs is kept in the final result list.

V-FIRST [75] simply allows the user to input two separate queries, then uses a weighted sum of the two queries to generate ordered pairs of images in a video and return them for the user to browse.

VERGE [6] limits temporal queries to concepts; namely, the user is able to query for two concepts that should appear in subsequent shots of the same video. For each concept, a separate list of shot probabilities is created, then the intersection of concepts per video is computed, and finally, shots are re-ranked through an objective function.

vitrivr’s [28] temporal queries are formulated using two or more blocks and upon presentation of the results, users have to switch to a dedicated temporal query result view [30]. In contrast, vitrivr-VR’s [71] temporal queries are formulated by grabbing and ordering small representations of query terms in virtual space. Temporal scoring is performed as described in [28]. The results are then presented to the user as stacks of temporally aligned segments that are relevant to the query.

VIREO [55] first measures the cosine similarity for the two successive queries independently. The two distributions of keyframes will be aggregated using a sliding window to produce the final probability.

Exquisitor’s [40] temporal queries are defined by the user training two relevance feedback models, focusing on different aspects of the desired shot. Once the models are defined, the results of each model will highlight the shots which come from the same video. In addition, temporal constraints can be better utilized to specify the desired target shot [37].

2.4 Relevance feedback

Once results are displayed in a video retrieval system, relevance feedback tools enable users to provide feedback in the form of positive or negative examples. Compared to kNN-based browsing, this feedback updates the model or the current score rather than issuing a new independent query.

vibro [33] only uses relevance feedback for AVS tasks, where all presented results have to be marked as positive or negative by the user after an initial query, and all positive keyframes are used to produce a consecutive result list.

CVHunter [49] implements a Bayesian-like approach [16] to accumulate relevance scores for each representative frame in the dataset. A temporal variant [53] of this relevance feedback approach was successfully tested in the system as well.

V-FIRST [75] has an optional pseudo-relevance feedback feature, where it assumes the top-k (with \(k = 10\)) initial results are relevant and reformulates the query by taking their centroid. This can be useful to cluster a small set of correct answers to the top ranks.

Exquisitor [40] uses relevance feedback as its primary interactive approach for search, where it trains a linear SVM model to construct a hyperplane to retrieve the most relevant items [38]. With multiple modalities involved, an SVM model for each modality is used to get candidates, which are then fused using rank aggregation. For VBS 2022, Exquisitor uses two visual modalities, semantic concepts from ImageNet and actions from Kinetics-700.

2.5 Query by example

Many VBS systems allow query reformulation, where users select an example item from the currently displayed candidate set. The essential part of this method is a similarity model assigning a similarity score for two items.

vibro [33] uses a Swin [48] architecture that has been fine-tuned for content-based image retrieval for visual similarity search. The final embedding was binarized and concluded to 1024 bits for each vector.

For visual similarity of two items, CVHunter, AVSeeker, V-FIRST, VideoFall and AIClub@UIT all use the same CLIP feature vectors [62] as was used for text search.

VISIONE [4] supports both visual similarity search, where the user can use an image as a query to search for video keyframes visually similar to it, and a semantic similarity search, where an image can be used to retrieve video keyframes or video clips that are semantically similar to it. The visual similarity search is based on comparing GEM [63] features. For the semantic similarity CLIP2Video [20] and TERN [58] are used for searching video clips and video keyframes, respectively.

In VERGE’s [6] system, the visual similarity search module enables the retrieval of visually similar content starting from a query image and considers feature vectors produced from a fine-tuned GoogleNet architecture [61] and an effective IVFADC indexing structure [35].

vitrivr [28] provides two modalities for query-by-example. One enables users to simply upload a sample image to find visually similar items, and the other operates via a “more-like-this” button positioned next to results.

vitrivr-VR [71] allows querying by frames of already retrieved videos through a similarity search. The feature used for this more-like-this search can be configured and was set to simple color and edge features for VBS 2022.

VIREO [55] calculates the cosine similarity of the dual-task model’s [78] embedding feature and indexes the KNNs for visual similarity search.

diveXplore [43] provides content similarity search with GoogleNet neural codes from ImageNet-1000, using the Manhattan distance to the selected example image.

2.6 Other

This section describes features and approaches which do not fall directly under any previous categories.

CVHunter [49] allows a fast inspection of top-k items from each video in the result set by pressing a number key (defining the k) on a numeric keyboard. For AVS tasks, the tool supports fast selection of all visible items and selection of a database subset distinct from another team member.

IVIST [42] exploits a scene-text searching function to search frames that contain the query text in the corresponding scene by adopting PixelLink [17] and ASTER [70], so that users can try to find frames where specific scene texts exist. VNUHCM and AIClub@UIT also follow the approach of using textual information in video frames for retrieval.

VideoFall [60] introduces a method of submitting results in its video retrieval system, which involves two distinct interfaces prior to submission. The first interface is designed for users to input queries and explore the keyframe collection like normal, whereas the second interface is designed for users to verify the data received from the first interface and subsequently submit the final frame to the competition server.

VERGE [6] utilises a human and face detection module that aims to count the number of individuals in each frame by identifying their silhouettes and heads using the CrowdHuman dataset [69] and the YOLOv4 [12] deep neural network.

Both vitrivr and vitrivr-VR employ a novel query-by-pose approach [28, 71], allowing the specification of poses seen in the target clip. This pose-based query mode uses key-points extracted from segment keyframes using OpenPose [13]. In addition, vitrivr allows users to query by pose by dragging the key-points on a 2D canvas, while vitrivr-VR allows the posing of the key-points in 3D space, which are then projected with perspective on a camera-like canvas.

diveXplore [43] supports search for texts in OCR results detected with CRAFT [7, 8].

The ViRMA [19] prototype system employs a novel VR interaction approach by utilising the M\(^3\) data model [23], which takes the media objects from the VBS dataset and maps them into a multidimensional media space based on their metadata. Users can then visualise the video data by filtering and dynamically projecting the multidimensional media space to the more familiar 3D space and then can explore this visualisation using virtual reality [19]. This type of 3D visualisation is effective at browsing and summarising a collection, but is less effective at search, which is likely why the ViRMA system did not perform well in VBS 2022.

2.7 Browsing

Table 1 contains four popular browsing approaches applied at VBS. The Ranked List simply refers to any visualization of the ranked result set. Video Summary refers to displaying a list of selected frames from a video. Video Player/Preview also refers to displaying frames, but at a higher frame rate (not just representative frames). Finally, the Temporal Context refers to the visualizations of resulting frames with the temporal neighborhood. Systems that have notable variations on these four browsing approaches are discussed in this section.

vibro [33] allows browsing of result lists by displaying the 4000 most relevant keyframes to the current query in a list or on a 2D sorted map. Additionally, the entire keyframe collection can be explored with the help of a hierarchical graph [32]. A single click on each of the presented keyframes opens the corresponding video in the video section of the UI, where all keyframes are listed in chronological order, and the video can be viewed with a video player. Double clicks on keyframes will create a new result list and jump to the keyframes location in the exploration graph section.

VISIONE [4] groups the results by video so that one row (containing up to 20 frames) for each video is displayed in the browsing interface; the rows of videos and the frames in it are sorted according to the score given by the retrieval model. There is a menu on each frame that allows the user to do similarity searches, see the entire video starting with the selected frame, or see a preview of the video in a neighborhood of the selected frame.

IVIST [42] displays the top-100 lists of keyframes at once, organised into pages. Short video clips (\(<5\)s), including each keyframe, are displayed as a GIF to provide temporal context. A keyframe can then be selected to display a video player function.

AVSeeker [41] generates a ranked-list of the top 2,048 keyframes that best match the query and groups them by video. The videos are then ranked by the average score of their top 3 best-matching keyframes to generate the final ranked list. These highest-scoring keyframes are also used as the preview of their corresponding video in the final result. Once a preview is clicked, a menu will pop up, which allows the user to see all matched keyframes, all keyframes of the video, and the video itself.

To expedite the process of elimination, V-FIRST [75] groups results by video up to a specific number of frames per video. Frames with high similarity are also removed to increase the variety of results.

VNUHCM [54] allows users to control the number of frames that are displayed. For each frame that is selected, a small video player of the corresponding timestamp is shown for users to interact with.

To allow fast and visually aided browsing within videos, vitrivr-VR provides a multimedia drawer view [71]. This video segment view, which resembles a VR drawer containing the most representative frames of the segments of a video, allows users to browse the segments of a video in 3D space simply by moving their hand through the drawer.

Exquisitor [40] displays the top 42 keyframes from its ranked list to the user. The user can either interact with the displayed keyframes to update the relevance feedback model, which will produce a new ranked list to get items from, or they can continue going through the current ranked list one keyframe at a time or get the next 42 keyframes. For this VBS, Exquisitor does not provide a video player for the shots. Instead, it displays a shot with 1 to 5 frames depending on its length. In addition to this the next 2 shots’ frames are also displayed below the selected shot [40]. Aside from the shot summary, a timeline browser for the video is available either as a vertical grid, or horizontal slider [39].

3 Overall results

Fig. 1
figure 1

Overall scores per team and task type

In this section, we will discuss the final results of the VBS 2022 competition in detail. For this purpose, we analyze all three task types separately: KIS-v, KIS-t, and AVS (see Sect. 1).

We start with the overall scores, which are shown in Fig. 1 for all teams. We can identify three major groups of teams. The first group consists of the four top teams, who achieved more than 230 points. Among them is the vibro team, who was able to collect the maximum score in all three sessions: 100pts in KIS-v, KIS-t, and AVS, respectively. vibro is closely followed by CVHunter and VISIONE, who also reached a similar score for KIS-t (100pts and 90pts) and KIS-v (96pts and 100pts), but got fewer points for AVS (81pts and 74pts). The IVIST team, as the last one in this group, was also able to get the maximum score for AVS but achieved substantially fewer points in KIS-t (59pts), while scoring well in KIS-v (90pts).

In the second group, there are the teams that scored 210-137 points (AVSeeker, V-FIRST, VideoFall, VERGE, vitrivr, VNUHCM, VIREO, AIClub, and vitrivr-VR). For these teams, we can see a much lower and linearly decreasing score, with different difficulties per team. For example, the vitrivr, VIREO, and AIClub teams were challenged by KIS-t, where they achieved only 30pts, 30pts, and 21pts, respectively. VIREO also had difficulties with KIS-v, where they achieved only 31pts.

Finally, in the third group, there are teams that were only able to collect up to 77 points: diveXplore, Exquisitor, and ViRMA. While Exquisitor was still okay in KIS-v (40pts), ViRMA could only score in KIS-t (9pts) and AVS (13pts). diveXplore scored in all three sessions but only with a low number of points (28pts, 23pts, and 24pts for KIS-v, KIS-t, and AVS).

Fig. 2
figure 2

Distribution of correct and incorrect submissions for known-item search tasks per team

From the number of submissions for KIS (Fig. 2), we can see that the teams in the first group were able to correctly solve 13 tasks in KIS-v, with only one wrong submission from CVHunter. The situation was different for KIS-t, where vibro submitted six wrong results, CVHunter one, VISIONE three, and IVIST two.

The teams in the second group were also very successful with KIS-v tasks but had substantially more wrong submissions (e.g., V-FIRST solved 12 KIS-v tasks correctly, but also had six wrong submissions). The AIClub team is an exception for KIS-v in this group: they solved 12 tasks correctly, with only two wrong submissions. For KIS-t, the situation was much worse than in the first group though: for many teams, the number of wrong submissions is higher than the number of correct ones (except VERGE, who could solve six KIS-t tasks correctly, with only one wrong submission).

In the last group of teams, the number of correct KIS submissions is generally low, except for Exquisitor, who could correctly solve six KIS-v tasks. It seems that ViRMA had serious difficulties with KIS-t, for which they submitted seven wrong submissions, while only one task could be solved correctly.

Fig. 3
figure 3

Distribution of correct, incorrect, and undecidable submissions for Ad-hoc video search tasks per team

When looking at the AVS tasks (Fig. 3), it is obvious that the IVIST team submitted most correct results (1851), closely followed by vibro with 1568 submissions, while CVHunter and VISIONE submitted only 1095 and 1038 correct results, respectively. However, it is interesting that also most teams in the second group found many correct items for AVS: most notably vitrivr and VIREO, who submitted 1310 and 1437 correct AVS items, respectively. Most wrong submissions were made by the top-scorer in this session (IVIST with 425 wrong submissions). The number of undecidable submissions was generally low (at most 62), which is not only evidence of great team performance in general, but also proof of confidence of the AVS judging team.

Fig. 4
figure 4

Distribution of time until the first submission per team and task type

It is worth noting that the submission time distribution provides deeper insights into how proficient each team’s system performed during the real-time competition. The faster the system locates the target, the more efficient it is. As can be seen from Fig. 4, most teams had the shortest time to search for the AVS tasks regardless of the correctness, followed by the KIS-t and KIS-v tasks, respectively.

Fig. 5
figure 5

Distribution of time until the (first) correct submission per team and task type

Figure 5 illustrates the distribution of time until the first correct submission across all teams and task type which excludes unsolved attempts. For the AVS tasks, the time is almost identical to the time to first submission meaning that many early submissions are correct. In contrast, the amount of time to find the correct answer for the other 2 tasks varies.

4 Analysis of KIS logs

As in previous years, during the competition, each team was asked to log the user interactions and the result sets of their queries for each task. Each team was given the choice of logging this data locally or sending it directly to the DRES competition server using a specific log format. In this section, we present the analysis of these logs to better understand the ranking performance of each system during KIS tasks.

The logs are in JSON format and each comprises the team identifier (in some cases, also the user identifier), timestamp, query description, and the list of top-ranked items that were retrieved for the query at hand. We report the analysis of result and query logs only for a subset of six teams (namely vibro, CVHunter, VISIONE, VERGE, vitrivr, vitrivr-VR). Unfortunately, the other teams did not log results in the common format or had incomplete/missing logs.

4.1 Log pe-processing

Of the six teams with logs, three saved the logs locally (vibro, CVHunter, VISIONE), while the others (VERGE, vitrivr, vitrivr-VR) sent the logs to DRES. We normalized the event timestamps of locally saved logs to the UNIX timestamp format used in DRES. Concerning possible clock shifts between DRES and local clocks, we mostly rely on the synchronization performed by the OS. However, we carefully checked shifts in the submission timestamps and, according to our analysis, there might be only small differences (about 1 s). Therefore, we conclude that the presented times are consistent, and the slight shift does not affect the following analysis. We filtered data to only contain log entries that fall into the task duration interval and also removed all logs that comes after the correct submissions of respective teams.

Note that the set of logs collected may be incomplete (due to external circumstances or team log choices) and thus our analysis represents an approximation of all interactions and results of the tools. For example, VERGE experienced network problems during the competition and some of its logs were sent, but not received by the DRES server as evidenced by the fact that there are no logs of this team for an entire task (T6). Moreover, different teams used different logging parameters and units of retrieval. vibro logged only the top-1000 results for each query, while other teams logged the top-10,000 results. In our analysis we considered only the top-1000 results to keep the logging scale the same for all teams. Concerning the logging unit, vibro, CVHunter, and VISIONE logged frames; VERGE and vitrivr logged segments (predefined shots and custom shots with time intervals, respectively); vitrivr-VR logged both frame and predefined segments. In order to standardize the units of retrieval, we transform each of them into a temporal form. Specifically, if a frame is given, we convert the frame number into the corresponding physical time, using the frame rate metadata associated to each video, and we check if it ends inside the ground-truth interval. If a shot id has been logged, we convert the shot id to the corresponding temporal endpoints inside the video, using the provided shots metadata, and we check if the middle time of the submitted interval is inside the ground-truth interval. We note that during the competition a live judge was allowed to manually accept submissions from the same shot just outside (less than 3 s) the KIS ground truth segment boundary. However, these cases are rare and for the analysis of result log item correctness, the original official ground truth was utilized.

In the following analysis, it is also important to capture submissions not only at the level of the whole team but first and foremost at the level of the specific user who used the tool. This is important, as it may happen that if we collapse all the statistics of the whole team—two users as they were a single one—some inconsistencies may arise.

Throughout the following analysis, for the systems that logged the user ID (vibro, CVHunter, VISIONE, vitrivr-VR) and for each task, we labeled the user as best and other, where we define the best user as the one among the two that, for that particular task, obtained—ordered by decreasing importance—(i) the best shot rank, (ii) the best video rank, (iii) the shortest time when the best shot was retrieved, (iv) the shortest time when the best video was retrieved. Each metric serves to perform a tie-break in case all the previous ones are equal among the two users. Using this formulation, the shot rank has primary importance. In fact, if the shot rank differs, we have that the best user is immediately the one having the lowest shot rank. In case any of these metrics are missing for that specific user and task, we set them to their maximum values (\(10^3\) in case of ranks and a time longer than the task duration in case of times).

Table 2 Textual KIS tasks

4.2 Comparison of retrieval models

Fig. 6
figure 6

Best rank of correct items appearing in result logs

One area of interest in comparing the system retrieval models is analyzing whether a correct item (frame or shot) of the searched video segment appeared in the top positions of the retrieved results. In this respect, Fig. 6 shows, for each system, the best-achieved rank of a correct item (frame or shot) before submission across all 23 KIS tasks (ten of which are textual KIS, reported in Table 2). The distribution of the minimum achieved rank by the best users reflects the overall teams’ scores. For example, the best rank is below 100 in about 87% of cases for vibro and CVHunter, 78% for VISIONE, 68% for VERGE, 65% for vitrivr, and 30% vitrivr-VR. However, the minimum achieved rank of the other user is below 100 in about 52% of cases for vibro, 35% for CVHunter, 48% for VISIONE, and 9% vitrivr-VR. The considerably worse performance of the other users for some tasks may be caused by two main reasons: (i) we are not considering the logs after a team's correct submissions; (ii) a particular user may formulate better initial queries for some tasks than the other user.

Fig. 7
figure 7

The table reports for each tool with logs (i) the best achieved rank of a correct item (frame or video shot); (ii) the time \(t_f\) in seconds when the best ranked correct item was retrieved; (iii) the best ranking of any frame/shot of the correct video (but not necessarily the correct video segment); (iv) the time \(t_v\) in seconds when the best ranked video frame/shot was retrieved; (v) the time \(t_{cs}\) of the tool’s correct submission. Red values are for the best detected ranks of searched video frames/shots if the target video segment was not present in the logged result for a task. Green cells show the best achieved correct item with a rank less than 100. Yellow cells show the best achieved video item with rank less than or equal to 10. Red cells indicates browsing failures, when a correct item was in the first 1000 results but was not submitted. Orange cells are other browsing failures, when the correct video was present—but no correct frame or shot was present—and no correct submission was made

In Fig. 7 we report, for each tool and each KIS task,

  • The best achieved rank of a correct item (frame or video shot of the target video segment) in the top logged results of the best user in the considered task (before the team correct submission, if any);

  • The time \(t_f\) when the best rank of a correct item – as described above – was obtained;

  • The best achieved rank of any item (frame or shot) of the correct video by the best user in the considered task (before the team correct submission, if any). Note that, in this case, the item may not overlap the target video segment even if it belongs to the correct video;

  • The time \(t_v\) when the best rank of any item of the correct video was obtained;

  • The time \(t_{cs}\) of the correct submission.

We can observe that the ranks and the overall competition scores of the top three teams are somewhat matched. In fact, vibro, CVHunter, and VISIONE are able to find the correct video in the first 10 results more consistently. In general, browsing failures (red and orange cells) are most evident in the case of textual KIS or, as easily guessed, when the best rank of a corrected item is high. However, it is also interesting to note that for some tasks and tools, a correct video was in the top 10 results, but it was not correctly identified and submitted (e.g., vibro in task T2, VISIONE in task T11, VERGE in task V10, vitrivr in tasks T1 and T8, vitrivir-VR in task V2).

We also emphasize that the best rank and correct submission times can be from different team members. Hence, it might happen that the time between the occurrence of the best item and submission is unrealistically low. For example, vitrivr-VR had the best video rank 97 in time 104 s and submission at time 106 s in task V10. On the other hand, long submission delays for both users just confirm issues with browsing.

Regarding vitrivr-VR, please also note that it uses an asynchronous workflow and allows users to browse very easily within multiple result sets at the same time, as well as within entire videos from different queries. The current logging format does not always allow the path from query to submitted result to be determined uniquely, and this explains why correct results submitted by vitrivr-VR do not always appear in the top-ranked logged query results or appear at a very high rank.

Some teams (e.g. VISIONE, vitrivr, and vitrivr-VR) display the search results by grouping together those from the same video and showing a limited number of items for each video in the browsing interface. Therefore, many correct submissions may have been generated from a video level-hit and not by scanning the top results from the highest to the lowest score (i.e., in the order in which they were logged and used in the analysis reported in this section). See for example, VISIONE in task V7: even if the best correct item rank is pretty high (751), the correct video was displayed in the first page of the results (7-th row in the browsing interface).

Fig. 8
figure 8

Relation between the rank of the first occurrence of a shot in the result logs and time delta to correct submission, for both visual a and textual b KIS tasks. NCS stands for Non-Correct Submissions and corresponds to all the correct frames found in the result logs that were not correctly submitted (either because of running out of time or incorrect submissions). The blue line is found through linear regression, and it is accompanied by the 95% confidence intervals

4.3 Browsing efficiency

Figure 8 shows the relation between the rank of the first appearance of a correct item (frame or shot) in the logged result set and the elapsed time in second between this first appearance and the correct submission, if any, both for visual KIS (left-hand graph) and textual KIS (right-hand graph) tasks. Note that these graphs give an approximation of the real browsing time because (i) it is possible that a correct submission was made though inspecting the video and not the top-ranked frames/shots, (ii) the team user who first retrieved a correct item may not be the same who submitted the final correct answer (information on which team member made the correct submission is not available for all teams). Nevertheless, these graphs give some insight into how long it took users to find a correct item once it was present in the result set. This time clearly depends also on the specific system browsing capabilities and the user behavior (e.g., some user may prefer to check just a limited number of results and eventually reformulate the query instead of exhaustively inspecting the results set). Overall, we observed that—as expected—the time between the first appearance of a correct item and the submission tends to increase with the rank of the item. However, in textual KIS tasks, it happens more often that the rank of the first appearance is low (even 1), but the operator takes a long time before submitting a correct result (which we recall he/she has never seen before, but knows only a textual description). For example, in the graph in Fig. 8b, we can see five cases where the rank was less than 25, but the operator took more than two minutes to make a correct submission, and in one case, no correct result was submitted at all. These outliers are less frequent in the case of visual tasks—only one team in a single task, VERGE in V10, had a correct item in 15\(^{th}\) position (obtained after only 20 s from the start of the task) but did not submit any correct results at the end. More generally, it is interesting to note that the variance of the time delta increases with the rank, as different strategies may be used by the team members., e.g. exhaustive inspection of a result set, query reformulation, or video-level browsing, just to guess a few.

4.4 Querying modalities

In this section, we aim to provide a more in-depth look on what kind of query modalities the individual teams actually used during the competition. In order to do so, we divided the query logs into six categories by summarizing the underlying analysis methods of the different teams. The outcome was: Text, Image, ODLS, OCR, ASR and Color. Text includes joint-embedding queries for most of the logging teams and VERGE’s concept search. Image groups methods such as query-by-example through visual similarity search and relevance feedback with global image embeddings. ODLS stands for object-detection-localisation and segmentation and includes those kind of queries that specified a number of objects or objects and their positions in an image. Since ODLS queries of the VISIONE system were often used in combination with the text modality, we keep also multi-modal combinations in the ODLS category. Examples are VISIONE’s concept search and VERGE’s number-of-object filter. Although OCR (optical character recognition) and ASR (automatic speech recognition) searches were formulated with text, the underlying analysis methods are fundamentally different compared to the other methods, which is why we assigned two additional categories. The last query type Color groups methods, where color was solely used for a search. Please note that >1000 used in tables/graphs may mean also that the searched target was filtered out.

Table 3 Percentages of query type use across all KIS tasks for each individual team

Table 3 depicts per-team relative usage of individual query categories. Throughout all teams, pure text queries are by far the most used variant, while image queries are the second most used. Obviously, these results are affected by what modalities are actually implemented by individual teams, but nevertheless, text queries dominate considerably.

Fig. 9
figure 9

Comparison of query type use across all teams for KIS-v and KIS-t tasks. Results are grouped by query type and task type

Additionally, Fig. 9 compares query modalities usage across the two task types KIS-t and KIS-v. Even though text is still the most frequently used category for both task types, query modalities such as image and ASR gain popularity in the KIS-v tasks, where audio-visual information is presented.

Nonetheless, a question may arise whether text queries are also effective apart from being popular. To clarify this, Table 4 depicts the ranking of correct shots/frames per team and query type. In most cases, the performance of text queries is similar to or better than the performance of other query modalities. One notable exception is CVHunter, where Image queries for instance achieved 43% results within top 100, while only 19% of text queries were within the top-100. In this case, however, CVHunter often utilized relevance feedback queries, which incrementally refines the previous text search results, so the performance of Image queries is in a sense pre-conditioned by the performance of text queries.

4.5 Querying density

Table 4 KIS tasks query statistics per team and query type

In this section, we focused on how individual teams divide their time between querying and other activity (e.g., browsing). First of all, we focused on whether the querying intensity changes in the course of the task duration. We divided each task into 1-minute intervals and counted the volume of per-team queries from this interval.Footnote 2

The count of per-team queries decreases from approx. 3.6 in the first minute to approx. 2.1 in the last minute. This may indicate that in later stages, teams focus more on browsing, while earlier they try to re-formulate their search more. Even though, the differences are not so substantial, so we can also focus on per-team querying density in general. Table 4 contains the mean volume of queries per team per minute of their active participation.Footnote 3 Here, two main outliers are VISIONE, who made on average 5.4 queries per minute and vitrivr-VR, who only logged 1.9 queries per minute. In case of vitrivr-VR, the main cause is the tool design itself, which is much more focused on browsing than its competitors. In the case of VISIONE, the cause is that queries are evaluated “on-the-fly” at any user interaction with the search interface (even just moving or resizing an object in the canvas), without the necessity of explicitly clicking on the “search” button.

Fig. 10
figure 10

Time to initiate the search with the first query. Results are grouped by team and task type

Finally, we also measured users’ reaction times, i.e., how fast did they construct the first query. Figure 10 depicts this statistic per team and task type. Notably, vitrivr-VR experienced significantly higher times for their initial queries than the rest of the teams. This is not unexpected, particularly considering the rise of text-based, cross-modal retrieval, since text entry in VR is still much slower than using conventional keyboards.

We also assumed that textual description is faster to process and therefore initial query times would be significantly smaller for textual KIS tasks. While this was true for some teams (vibro, VISIONE), the results were not conclusive in general. In fact, it is important to note that certain teams including VISIONE and vibro, occasionally did not manually type the textual query during textual KIS. Instead, they copied and pasted the query from the DRES visualization interface used to view the tasks. Unfortunately, this copy and paste action was allowed but not logged, so it is impossible to determine which teams relied on this method and how frequently they did so.

4.6 Analysis of textual queries

Fig. 11
figure 11

Distribution of the number of words and string lengths for textual queries. Grouped by individual teams and ranks of the correct results

In this subsection, we focused on the properties of the textual queries. Specifically, we evaluated the length of textual queries w.r.t. the number of words and number of characters, for which the mean values are depicted in Table 4. Note that vitrivr logs do not contain the text of the query, therefore we exclude them here. We observed some notable differences in textual querying strategies of individual teams: both VERGE and vitrivr-VR used on average shorter queries (w.r.t. both metrics), while VISIONE usually constructed much more complex queries - in average twice as large as its next competitor.

There are two main reasons behind this observation. First, VISIONE provided more extended and detailed textual descriptions of the searched scene compared to other teams. Second, a significant portion (about 83%) of VISIONE’s queries were temporal textual queries (descriptions of two different scenes of the same video clip), which were concatenated as a single textual query in this analysis. We also note that the vibro team did not log temporal queries as temporal fusion was an on the fly computed interface option for two independent queries.

We were also interested, in whether the additional effort coming with the construction of larger queries pays off, i.e., whether better ranking of correct items/videos are achieved. Nonetheless, as teams use different ranking models and per-team querying strategies also differ substantially, we have to resort to per team comparisons. Figure 11 depicts per team enhanced boxplots of queries, where the correct shot was within, or outside of top-1000 results. In general, the differences were smaller than we expected, but even though, for vibro, CVHunter and VERGE teams, within top-1000 queries were significantly larger (w.r.t. both number of words and number of characters; in case of t-test, p-values <= 0.04) than outside top-1000 queries. Especially for vibro, the pattern was quite notable. However, just producing larger (more descriptive) queries might not lead to better results. This is illustrated by VISIONE, whose queries were largest in general, but lengths of within top-1000 and outside top-1000 queries were without statistically significant difference.

Fig. 12
figure 12

Ranking distribution of correct shots w.r.t. query sequence position. “First” and “Second” denote the first text query per task and the next text query (i.e., first reformulation/extension). “Last” denotes the very last text query logged for a particular user and task. Note that only query sequences with length \(>=\) 3 are depicted

We also observed how much the initial results could be improved via subsequent text query reformulations. For this, we grouped all queries collected for each user and task and ordered them from first to last. We denote this as query sequences and grouped queries w.r.t. their position within the sequence. Figure 12 depicts enhanced boxplots for the ranks of correct shots. It can be seen that textual reformulation may lead to some notable improvements. On the other hand, results also reveal numerous browsing errors, where correct shots were within top 10 or top 100, but queries were reformulated anyway. This may indicate the necessity to focus more on the browsing capabilities of individual tools to prevent such oversights.

Table 5 AVS tasks

5 Analysis of AVS tasks

As in previous years, VBS organized another session focusing on ad-hoc search tasks. Specifically, 8 AVS tasks (see Table 5) were performed where teams are required to submit as many correct shots as possible. Figure 13 shows shares of correct submissions of all teams in all tasks. It is apparent that there is not one dominant team for all tasks. For example, looking at the top two AVS systems, vibro was way more effective than IVIST in the task a01, while in the task a10 the situation was reversed. Nevertheless, all top-performing AVS systems (vibro, IVIST, VIREO, and CVHunter) show an ability to solve a non-trivial share of the multi-set of correct submissions.

Figure 14 shows how the overall number of received correct submissions grows over time of each AVS task, while Fig. 15 shows the number of submissions in specific time slots.

In both graphs, the same submissions from n different teams count as 1. Hence, the graphs show the progress in the detection of new and unique correct scenes. The trend is similar for all tasks. After a first slow period (about 40 s, except task a07), there starts to be a continuous growth with occasional peaks.

Figures 16 and 17 present the number of submissions and the first k submissions by n teams. It is apparent that the overall number of submissions is always higher than the overall number of correct submissions (only unique submissions are counted). Regarding times to first submissions, there is not a clear difference between the time until the first submission and the first correct submission. Similarly, except for task a09, the times to the first submissions by 50% of teams are quite similar to the corresponding times in the correct submission graph. However, the times to the tenth submissions by 50% of teams are becoming lower than the times for correct tenth submissions by 50% of teams in tasks a02, a05, and a06. To sum up the analysis, there are differences in the complexity of AVS tasks. Some tasks are easier to solve for many teams, while others are way more challenging and also interesting for VBS-like interactive search evaluations.

The VBS 2021 report [31] presented an observation that in several AVS tasks, there were many teams in disagreement with one judge. Since the text query preparation for VBS 2022 was more thorough (see Sect. 6), we also analyzed the agreement/disagreement stats in Table 6. Compared to the previous year, the data do not reveal a significant level of disagreement across seven tasks. Indeed, except a few exceptions (e.g., eleven teams against one judge in task a02), the teams mostly agreed with the judge's decision. Only in one task a09, there are cases where the teams disagreed with judges more often. However, this might also be caused by the task's difficulty and attempts to send at least something (the overall numbers are low).

Fig. 13
figure 13

Share of AVS submissions judged as correct per team and task

Fig. 14
figure 14

Cumulative correct video submissions over time during an AVS task. Log scale on y axis

Fig. 15
figure 15

Correct video submissions over time during an AVS task

Fig. 16
figure 16

Selected AVS metrics per task. Higher y-axis values indicate that teams found it easier to find results to submit for a task

Fig. 17
figure 17

Selected AVS metrics per task, looking at correct submissions. Higher y-axis values indicate that for a given task, it is easier to find results that judges deem correct

Table 6 Number of teams in agreement/disagreement with judges

5.1 New direction for Ad-hoc search at VBS

For many years, AVS tasks were evaluated at VBS in a similar fashion as at TRECVID. Teams were supposed to submit as many correct shots as possible, often overloading judges with thousands of submissions. The scoring function was designed to provide a high score for precision and (pooled) recall. However, there were also opinions questioning the current way AVS tasks are evaluated.

Although 100% recall with high precision of found shots is an important goal in various domains (security, endoscopy), for VBS, it is also highly important to first localize videos containing a correct shot. In other words, ad-hoc search can be divided into two task categories—localization of correct videos and effective search (e.g., advanced browsing) of the videos. From the organization's perspective, finding just one piece of evidence of video correctness (i.e., only the first correct shot) decreases the workload for judges and also simplifies discussions about a fair scoring function. Therefore, we have decided to focus on the video localization part of AVS tasks at the next VBS events.

A possible new scoring formula to compute AVS score per team can follow the objective of finding as many videos as possible, where the team must submit one correct shot from each video (i.e., evidence for each video). We note that the formula should integrate a penalty mechanism preventing floods of unverified submissions. In addition, a maximum limit of submissions per team could also be introduced as well (not considered currently). An example approach to defining the score \(f_{t}\) of a team t in an AVS task could be as follows:

$$\begin{aligned} f_{t}=&1000 \cdot \max \Big (\frac{1}{|C|}\sum _{v}^{\mathcal {V}_t}\big (c_v-i_v\cdot p\big ),0\Big ), \text {where}\\&i_v := \text {number of incorrect submissions before} \\&\quad \text {the first correct submission for video } v, \\&\quad \text {number of submissions in } v \text { else}\\&c_v := 1 \text { if correct submission for } v, 0 \text { else}\\&\mathcal {V}_t := \text {set of videos with a submission for team t}\\&p := \text {submission penalty constant (e.g. } 0.2\text {)}\\&C := \text {distinct correct videos across all teams} \end{aligned}$$

Although there are data to estimate the penalty from the VBS 2022 competition, we plan to carefully set this penalty based on more experiments. The reason is that the available data might be biased with respect to the AVS scoring formula used at VBS 2022.

While possibly suffering from the bias of teams optimizing for the scoring function used at the VBS 2022, an early analysis of this data shows an average of 1.39 incorrect submissions before the first correct submission per task, team, and video. The maximum number of incorrect submissions before the first correct submission for the corresponding video was 19. With the new scoring function in place, it will be interesting to see if more careful inspection of submissions can be encouraged.

6 Lessons from text query definition

In the purely physical VBS editions up to 2020, the judges for AVS tasks were seated next to each other in the room, and any questions concerning the ambiguity of handling queries were done informally among them. As reported in [31], a briefing with the judges was conducted for the first virtual VBS in order to discuss and refine the AVS queries, but was found insufficient to ensure consistency of judgements. Thus, the briefing of the judges for VBS 2022 was extended to consist of (i) a session discussing and refining the AVS query texts like in 2021, (ii) a similar session for refining the KIS-t queries and (iii) a dry-run session in which the judges tried to solve the AVS tasks themselves and provided feedback about the query texts. Both the discussion and dry-run sessions were held as web conferences. Six judges participated in the sessions. More details and the evolution of this process can be found in Bailer et al. [9].

The creation procedure for creating the textual queries was unchanged from previous editions of VBS (see [51] for details). The queries were provided in a shared document that was made available to the judges in advance of the sessions. However, the ground truth was not included in the document.

Discussion session. In the discussion session, both AVS and KIS-t queries were covered. For AVS queries, the query text was read together with the judges and they were asked whether they could imagine scenes covered by the query, and request clarifications on possible interpretations of the queries that came to their mind. A reformulation that found consensus in the group was chosen for proposed changes. Where necessary, additional notes were recorded for later reference by the judges.

For KIS-t queries, the query was read together with the judges and the target clip was shown. As it was unclear which order would be better, reading the query first and watching the clip were tried, but first, reading the query seemed preferable. Then required changes and clarifications of the queries were discussed, watching the clip again if needed.

Dry-run. In order to perform the dry-run, SOMHunter V2 [77] was used as a browsing tool. The existing Docker deployment of the toolFootnote 4 was modified to run a set of independent instances (one per judge) on the same machine. A startup script took care of modifying configurations so that the Docker containers required by each instance would use a dedicated set of ports. The containers were hosted on an Amazon Web Services EC2 machine with 64GB RAM.

After a very brief introduction to the tool, the judges were given up to 10 min to explore and discuss one query. Searching was stopped once a larger number of results was collected. The results were analysed in order to understand what type of content could be found for the query, and which ambiguities and border cases may exist. Similar to the discussion session, consensus on reformulation of queries was found and additional notes were recorded where necessary. All but one of the AVS tasks have been solved by the judges in the dry-run session, which already provided a good indication that the tasks would be solvable in the competition.

Query improvements. Both sessions resulted in a number of changes to the originally proposed queries. As shown in Fig. 18, the mean lengths of the queries increased after each of the sessions as details and clarifications were added. Figure 19 provides details about these changes on a word level, expressed as the number of changes per query. Most changes concerned nouns: more than 1.6 noun additions/changes per query were made for AVS queries, and almost 2 additions/changes for KIS-t queries. The numbers are slightly lower for adjectives and prepositions, but the pattern is similar to nouns. It is worth noting that while the number of changes is similar for both types, more additions were made for KIS-t queries, which indicates the higher need to adding details. Also, changes of words (typically finding a more precise term of easier to understand synonym) were only done in the discussion session, while after the dry-run only words were added. Examples were added for some AVS queries, but this occurs in less than 1 in 5 cases.

Assessment. In order to assess the effectiveness of the modified judges briefing, we performed an online survey among all participating team members in the week after VBS 2022. We received 20 responses, of which 17 respondents stated they had participated in 2021. The repeated participants were asked to compare the clarity of the KIS-t and AVS queries as well as their perception of the consistency of the judgements of their AVS submissions on a 5-point scale ranging from much worse to much better. The responses to these questions are shown in Fig. 20. Roughly one-third of the respondents did not observe any changes, and one-eighth found the KIS-t queries less clear than in the previous year. But the majority of the respondents found the clarity of the descriptions as well as the judgement consistency better or much better. It is worth noting that for AVS clarity and judgement consistency, none of the respondents reported a decrease in the quality and also much better was chosen in some cases (which was not chosen for KIS-t). We believe that this is a consequence of performing the dry-run, which helped both improve AVS queries and ensure later judging quality.

Fig. 18
figure 18

Mean lengths of text queries: originally proposed queries, after discussion of the texts and after the dry-run (applies only to AVS)

Fig. 19
figure 19

Word-level changes to AVS an KIS-t queries made as a result of the discussion and of the dry-run. Adding examples only applies to AVS queries

Fig. 20
figure 20

Comparison of query clarity and judgement consistency between VBS 2021 and 2022

7 Discussion and future challenges

A VBS-like evaluation is a unique large-scale experiment organized once per year. The observed results of the experiment provide a unique insight into the expected interactive search performance in KIS and AVS challenges with current state-of-the-art models. Based on the observed results, we would like to provide a summary of findings with a discussion on future directions.

Fig. 21
figure 21

Zipf’s law in a video collection

  • We start with the most resonating message (and not only within the VBS community [47]) that the CLIP model and its variants represent a game-changer in cross-modal search. The approach and its near-future potential (using larger training datasets [68]) may break some assumptions that were made for challenges like VBS. This can be shown with Fig. 21 illustrating Zipf’s law for the commonly observed distribution of concepts in image datasets [81]. For known-item search tasks, there were two challenges – concepts with a high number of occurrences often required non-trivial additional interaction, while rare concepts used to be hard to find with models trained with standard general-purpose training datasets. With CLIP based models and huge training datasets, rare concepts are now more likely to be known by content-based ranking approaches and thus often become findable with a free-form text query. Hence, one of the key remaining challenges for the future seems to be effective refinement and browsing in large clusters with many similar instances. We add that large clusters can also emerge if users do not actively remember all the details of the searched scene.

  • Based on our survey on textual task quality, the concepts users imagine are sometimes culturally dependent. These issues indeed remain a challenge even with potentially much better ranking-models and represent an interesting task for future evaluations. We note that, so far, VBS has mostly data of users from Western cultures using CLIP, so the alignment of concepts between user and model may be high.

  • Human interaction was still important even at this iteration of VBS, where top performing teams already used CLIP. Even despite the overall good ranking performance of the joint embedding models, one third to one half of issued text queries (for systems employing the CLIP model) ended up with a search scene outside of considered ranked results (i.e., rank > 1000, see Table 4). On average, users are able to improve their textual queries over time (Fig. 12) and also refine/modify the queries with different modalities (Fig. 9).

  • Another future challenge are evaluations of the impact of users. We can observe that today’s vision-language models are sensitive to differences in wording, so their use in video retrieval systems harms the retrieval consistency [10], i.e., the desirable property that a system returns consistent results for similar but differently phrased information needs. We are planning future evaluations with more users per team and more controlled settings (predefined start query) to further analyse the impact of users vs. tools.

  • Another interesting challenge for future evaluations is to focus on different types of visual data. Interactive search in various visualizations of other types of data (e.g., images from human motion in RGB color space) might test generalization of presented approaches and systems for different domains, especially in situations where a good initial text query is not available.

8 Conclusions

The paper presents findings from the eleventh iteration of the Video Browser Showdown, where 16 teams participated with their interactive video search systems. The wide panorama of video analysis and retrieval approaches was used, as described in the related work section. The results confirm the effectiveness and reign of joint-embedding approaches, where CLIP-based models demonstrate impressive performance. The top three systems vibro, CVHunter, and VISIONE (according to the VBS ranking), were able to solve all visual known-item search tasks as well as almost all textual known item search tasks. Considering the size of the video collection, this is a great achievement compared to the previous several years. The result logs of six teams revealed that with multiple attempts to formulate a (mostly text) query, the teams were able to find known-items at good ranks. However, there emerged also several browsing/visualization issues where the teams overlooked a correct item with a good rank. The analysis of AVS tasks did not reveal a clear winner, although the average performance of vibro, IVIST, and VIREO teams were impressive. The analysis of agreement/disagreement with judges revealed a positive effect of the new query preparation process, which was supported by an online survey as well. For future VBS evaluations, we plan to make visual known-item search tasks harder (e.g., shorter targets segments or domain-specific collections) and reconsider AVS tasks (search for videos with a correct item).