1 Introduction

With the booming of the Internet, the number of videos on the web is growing at an unprecedented rate. The analysis of video spatiotemporal data allows us to utilize useful information and knowledge in a timely manner, which may further improve the effectiveness, reliability and efficiency of various tasks of video understanding [1,2,3,4]. Over the past few years, a lot of work has been conducted for video content recommendation applications with action recognition [5, 6] or action retrieval techniques [7]. Temporal action or event discovery aims to detect or retrieval a potential variation from untrimmed video, yet the variation belongs to a pre-defined action set or a specific query. Recently, with the development of computer vision and natural language processing, video moment retrieval is emerging and becoming a hot topic. Given a textual query, the goal of video moment retrieval is to locate a video segment (with starting and ending timestamps) that corresponds to the semantics of the query. Compared with temporal action detection, the task of video moment retrieval requires the models to understand the video and query in holistically and simultaneously. It has to process multi-modal spatiotemporal data and build cross-modal interaction models efficiently, and it has many challenges and broad application scenarios.

Many existing works [1, 8,9,10] have been proposed to tackle the task of video moment retrieval, including proposal-based and proposal-free methods. In video moment retrieval, a proposal is a segment candidate that may correspond to the target ground-truth. In the early years, video moment retrieval was typically treated as a matching problem, some proposal-based approaches tackle it in a propose-then-rank manner. These methods usually generate proposals with pre-defined sliding windows or anchors, then compute the semantic similarity between the query and each proposal. The proposal with the highest score is considered as the query results. Liu et al. [8] proposed attentive model to emphasize the importance of query. However, these proposal-based approaches are sensitive to the set of sliding windows or anchors. In addition, the proposal evaluation requires a lot of memory and computation resources. Inspired by the fast of proposal generation in temporal action detection [11], Yuan et al. [12] proposed a stacked convolution block to build dense proposal with the help of semantic conditioned dynamic modulation.

To address the consumption of enumerating calculation in proposal-based methods, many proposal-free based methods have emerged and developed. Proposal-free methods typically attempt to predict the start and end timestamps directly, without any proposal generation and ranking. The temporal modeling of video moments proposed by Yuan et al. [13] has become increasingly popular in recent years. The main difference from the existing work lies in the design of the multi-modal fusion module. For example, Mun et al. [9] proposed a local-global video-text interaction approach for deeply modelling the semantics of phrases and video clips along the timeline. However, the challenges of spatiotemporal modeling and video context understanding are still to be explored.

In this paper, we still focus on the spatiotemporal modeling and context understanding of videos. As shown in Figure 1a, given a language query, video moment retrieval aims to locate the action corresponding to the query. As shown in Figure 1b, we target to investigate the effectiveness of contrastive learning for video moment retrieval. Given a sentence query as the anchor, we first perform multi-modal interaction, then sample the positive and negative features from the set of multi-modal features. Finally, we apply the contrastive learning to narrow the semantic distance between the positive sample and the anchor, and enlarge the semantic distance between the negative samples and the anchor. As a result, the target temporal features become more distinguishable. The main contributions of our method are as follows:

  • We propose a novel framework spatiotemporal contrastive network (STCNet) for video moment retrieval, which aims to enhance the feature representation of target temporal locations.

  • We propose a Boundary Matching (BM) sampling module for dense negative sample sampling. Given a query, we deem the temporal region of ground-truth as positive and sample the adjacent regions but aligned to query as negative samples. Moreover, we use the Gaussian filter to calculate the sampling mask. We use contrastive learning to refine the discrimination of target temporal features, and it does not need any cost in inference.

  • We propose the Local-Global Temporal Context Module (LGTCM) to perform local-global context modeling of multi-modal features. Specifically, we use 1D convolutional layer to model local context, and the non-local network to build the long temporal dependencies of global context.

  • Extensive experiments are conducted on three benchmark datasets, Charades-STA, ActivityNet Captions, TACoS, and demonstrate the effectiveness of the proposed method. Ablation studies and qualitative visualizations also verify each component.

Figure 1
figure 1

(a) An illustration of the video moment retrieval task. Given a video and an language query, the target is to locate a moment, which semantically corresponds to the query. (b) The core idea of the proposed contrastive learning

2 Related work

In this section, we review the related works about video moment retrieval and focus on the proposal-free methods that our proposed method is mainly compared. Subsequently, we introduce contrastive learning works and how it can potentially be used in the task of video moment retrieval.

2.1 Video moment retrieval

Video moment retrieval (VMR) [7, 14], is also called video grounding [9, 10], which aims to retrieve the temporal moments in the untrimmed video that semantically correspond to the linguistic query. It plays a crucial role in the field of video understanding [10, 15,16,17]. The main solutions can be divided into the following two types.

Proposal-based methods

Gao et al. [1] first put forward a novel task formulation of temporal activity localization via natural language query. And then this task is evolved into solving the grounding actions and objects by language in videos. Early works scan videos with various sliding windows to generate candidate proposals and compare the proposals with the sentence query, then get the best matched proposal as the optimal result [1, 8, 18]. Based on this manner, the researchers consider the semantic correlation of the video and query is a vital part of the task; thus, they make use of various attention mechanisms to strengthen the interaction learning between the video and sentence query. For example, the moment alignment network (MAN) model [19] explicitly models moment-wise temporal relations as a structured graph and designs an iterative reasoning diagram to learn the relationship among candidate proposals. The Temporal GroundNet (TGN) model [20] captures more fine-grained frame-by-word interaction between the video and sentence query and generates the final grounding result by integrating the contextual information of temporal sequence. In addition, a novel 2D Temporal Map [21] has been proposed to describe all possible proposals. It tackles the problem well since it takes time dependence into account rather than considering temporal moments individually. In other words, the 2D Temporal Map has the ability to anticipate the score prediction for all possible proposals.

Proposal-free methods

However, as we all know, the above-mentioned proposal-based methods are restricted by computational expense and limited efficiency. Yuan et al. [13] first propose a proposal-free approach to solve the VMR problem. This method is performed based on direct temporal boundary regression by achieving multi-modal interactions through simple “splicing”, which solves the temporal sentence localization problem from a global perspective through an attention-based location regression (ABLR) approach.

To eliminate the imbalance of training samples in boundary regression, Lu et al. [22] achieve dense positive samples by predicting offsets from the ground-truth moment center. Another direction of the proposal-free methods is the boundary probability regression, which aims to predict the probability curves of start and end positions over each. To consider that the variation between consecutive video frames is small and the words in query may have different meanings between adjacent ones, the video span localizing network (VSLNet) model [23] uses contextual query attention to perform fine-grained multi-modal interactions and performs two conditional span predictors to predict the start and end boundaries of answer spans.

In addition to the above two basic proposal-free frameworks, more and more improvements in task can be achieved by modifying the multi-modal interaction module or introducing other advanced semantic understanding modules. For example, Gao et al. [7] replace the cross-modal interaction module with a cross-modal common space to achieve fast video moment retrieval. Adversarial bi-directional interaction network (ABIN) [24] designs an auxiliary adversarial discriminator network to generate coordinates and frame-dependent distributions for moment boundary refinement. By comparison, our work aims to enhance the discrimination of target temporal features in moment retrieval by using contrastive learning. As a result, the model will locate the target segment more accurately.

2.2 Contrastive learning

Contrastive learning has been successfully applied to unsupervised representation learning tasks in recent years. Contrastive learning gives the backbone network the ability to distinguish relevant and irrelevant samples by maximizing the difference between positive samples (the pair of samples in the input modality and the corresponding sample in the target modality) and negative samples (randomly selected samples in the input modality and the target modality). Most of the current contrastive learning methods are conducted based on the paradigm of self-supervised feature learning. Lorre et al. [25] propose a self-supervised video representation learning method based on Contrast Predictive Coding (CPC), which learns the long-term relationships behind the original signal sequence and predicts the potential representations of future clips in the video. He et al. [26] propose Momentum Contrast (MoCo), arguing that using more negative samples is what makes the contrast learning method work better. This is because more negative samples are needed to effectively cover the underlying data distribution, and MoCo trains the model by increasing the proportion of negative samples, surpassing even supervised target detection methods based on ImageNet initialization in the field of visual representation learning. Most of the current video comparison learning methods have similar loss functions. The difference is how to make positive and negative pairs. Sun et al. [27] propose Contrastive Bidirectional Transformer (CBT), which uses the video clip and its masked version as a positive pair. Han et al. [28] propose dense predictive coding (DPC), which uses the predicted autoregressive features and the ground-truth features at the same spatio-temporal location as positive pairs.

3 Proposed method

The overall architecture of the STCNet is shown in Figure 2. The STCNet consists of feature encoding, multi-modal fusion, temporal learning and attentive regression. In this section, we first describe the problem of video grounding in Section 3.1. Then, we use two feature encoders to obtain the video and text features in Section 3.2. Next, the outputs of the feature encoders are fused into the contextual semantics by Local-Global Context Modeling in Section 3.3. The core contrastive learning is then introduced in Section 3.4. Followed by a regular regression module in Section 3.5. The obtained multi-modal features are turned into vectors to predict the query corresponding to the starting and ending times (ts, te). Finally, the loss function used in STCNet is presented in Section 3.6.

Figure 2
figure 2

Overview of the proposed framework, which consists of feature encoders, multi-modal feature fusion, local-global temporal context modeling, contrastive learning and self-attentive regression

3.1 Problem formulation

Given a query sentence, the task of video grounding aims to localize temporal moments with the starting and ending times (ts,te) in the queried videos, which semantically corresponds to the query. We denote the visual features of a video with T segments as \(V=\{v_{1}, v_{2}, v_{3},\ldots ,v_{T}\}\in \mathbb {R}^{d_{v} \times T}\), where vi is the i −th segment feature with the dimension of dv. The textual features are denoted as \(Q=\{q_{1}, q_{2}, q_{3},\ldots , q_{L}\}\in \mathbb {R}^{d_{q}\times L}\), where L is the length of query and dq is the dimension of textual feature. Give video V and query Q, we aim to learn a deep learning model as \(\mathcal {F}\), and the corresponding queried video segment (i.e., starting time ts, ending time te) can be predicted by:

$$(t^{s}, t^{e}) = \mathcal{F}(V, Q, {\Theta}),$$
(1)

where Θ is a set of parameters of the model \(\mathcal {F}\).

3.2 Feature encoder

This task refers to a multi-modal understanding task, we have to handle the feature encoding of both video and query.

Video Encoder

Given an untrimmed video V, we first equidistant sample and extract the segment-level features with a fixed length T by using the pre-trained 3D network, namely \(F^{v}\in \mathbb {R}^{d_{v}\times T}\). Following the common practice in this field, we use the feature encoder in QANet [29] to further embed the visual features. Specifically, the feature encoder for videos is composed of Positional Encoding (PE), Multi-head Self-attention (MHA), Feed-forward Network (FFN) and LayerNorm (LN) operation. The calculation of the feature encoder are as follows:

$$\begin{aligned} F^{v^{\prime}} &=\text { VideoEncoder }\left(F^{v}\right), F^{v} \in \mathbb{R}^{d_{v} \times T}, \\ &=\left\{\begin{array}{c} \widetilde{F^{v}}=\text{PE}(\text{FC}(F^{v})); \\ \widehat{F^{v}}=\text{LN}(\text{MHA}(\widetilde{F^{v}}) + \widetilde{F^{v}}); \\ F^{v^{\prime}}=\text{LN}(\text{FFN}(\widehat{F^{v}})+\widehat{F^{v}}), \end{array}\right. \end{aligned}$$
(2)

where PE denotes the positional encoding function stated in [30], d denotes the dimension of feature encoding. Up to now, we get the advanced visual feature \(F^{v^{\prime }}\in \mathbb {R}^{d\times T}\).

Query Encoder

For a query sentence with L words, we encode it by using GloVe embedding [31] and represent the word-level texture features as \(Q=\{q_{1}, q_{2}, q_{3},\ldots , q_{L}\}\in \mathbb {R}^{d_{q}\times L}\), where dq denotes the word embedding dimension. Then, we use a Bi-directional LSTM [32] to encode Q into the sentence-level vector.

$$q = BiLSTM(FC(Q))\in \mathbb{R}^{d},$$
(3)

where d denotes the feature dimension.

Multi-modal Feature Fusion

To integrate the above multi-modal features of video and query, we perform a segment-level modality fusion by using the Hadamard product. The whole process is summarized as follows:

$$F^{f} = W_{f}(W_{v} F^{v^{\prime}} \odot W_{t} q)\in \mathbb{R}^{d \times T},$$
(4)

where \(W_{v}, W_{t}, W_{f}\in \mathbb {R}^{d\times d}\) are learnable embedding matrices for multi-modal feature fusion, and ⊙ is the Hadamard product operator.

3.3 Local-global temporal context modeling

The task of video moment retrieval aims to localize a temporal segment along the temporal dimension. Based on the above obtained feature sequence \(F^{f} \in \mathbb {R}^{d \times T}\), we make efforts to refine it for the final prediction. To be specific, we propose a Local-Global Temporal Context Module (LGTCM). The LGTCM first learns the local context information through 1D convolutional layer, then learns the global context via non-local block [3]. As shown in Figure 3, the local contextual modeling is formulated as follows:

$$\begin{array}{@{}rcl@{}} \widetilde{F^{f}}&=&ResBlock([{F_{1}^{f}},\cdots,{F_{T}^{f}}]);\\ &=&Conv (LN([{F_{1}^{f}},\cdots,{F_{T}^{f}}])), \end{array}$$
(5)

where \({F_{i}^{f}}\) denotes the i-th feature in Ff and ResBlock is a residual block [9] consisting of two temporal convolution layers in our work. \({F_{1}^{f}},\cdots\), and \({F_{T}^{f}}\) share the same model parameter of ResBlock in this calculation.

Figure 3
figure 3

The network architecture of Local-Global Temporal Context Module (LGTCM)

Then, the global contextual modeling is formulated as follows:

$$\begin{array}{@{}rcl@{}} F^{h}&=&NLBlock(\widetilde{F^{f}}); \widetilde{F^{f}} \in \mathbb{R}^{d \times T}; \\ &=&\widetilde{F^{f}}+(W_{rv}\widetilde{F^{f}})softmax(\frac{(W_{rq}\widetilde{F^{f}})^{T}(W_{rk} \widetilde{F^{f}} )}{\sqrt[]{d}})^{T}, \end{array}$$
(6)

where Wrv, Wrq, \(W_{rk}\in \mathbb {R}^{d\times d}\) are learnable matrices, and NLBlock(⋅) denotes the non-local neural networks [3]. Finally, \(F^{h} \in \mathbb {R}^{d \times T}\) is the final feature sequence output by the LGTCM module.

To summarize, we first use 1D convolution to learn the relationship of adjacent moment points (local features) in the fused multi-modal features, and then use multi-head self-attention in non-local block to model the global information features. The parameter N represents that 1D convolution is conducted in an N-layers stack, and the parameter M denotes the LGTCM is stacked M times.

3.4 Spatiotemporal contrastive learning

In this section, we propose a spatiotemporal contrastive learning for the feature enhancement of target temporal location. We first introduce the boundary matching sampling, and then introduce how the contrastive learning can be used in the task of video moment retrieval.

3.4.1 Boundary matching (BM) sampling

Since the multi-modal features Fh contain rich information for target moment prediction, we attempt to impose an effective contrastive restriction on the features to refine them. First, we have to select positive and negative samples. In our work, given a query q as the anchor, we deem the ground-truth temporal region \([t_{s}^{gt}, t_{e}^{gt}]\) as the positive sampling range. Here, we mainly discuss the negative sampling strategy.

Inspired by the boundary matching network [33], we enlarge possible candidate proposals around \([t_{s}^{gt}, t_{e}^{gt}]\) and propose an Boundary Matching (BM) sampling module for splitting these possible proposals. We give an instance visualization of BM sampling in Figure 4. To be specific, given a query q and its ground-truth temporal region \([t^{gt}_{s}, t^{gt}_{e}]\) in Figure 4(a), there is a temporal interval \(d_{q}=t^{gt}_{e}- t^{gt}_{s}\). We use a sampling hyperparameter α to set closely similar but negative boundary windows \([t_{s}^{gt}-\alpha \cdot d_{q}, t_{s}^{gt}]\) and \([t_{e}^{gt},t_{e}^{gt}+\alpha \cdot d_{q}]\) unaligned with query q. Next, we sample Nsam points uniformly in \([t^{gt}_{s} - \alpha \cdot d_{q}, t^{gt}_{s}]\) and do the same sampling operation in \([t_{e}^{gt},t_{e}^{gt}+\alpha \cdot d_{q}]\) too. Thus, there are 2Nsam negative sampling points in total.

Figure 4
figure 4

(a) Illustration of the positive and negative sampling strategy in contrastive learning. (b) Illustration of sampling mask (weight mask) \(M^{neg}_{l}\in \mathbb {R}^{N_{sam}\times T}\). (c) Illustration of the Boundary Matching (BM) sampling module

As shown in Figure 4(b), taking the n-th sampling as example, we get the sampling timestamp tn and create a mask vector \(m_{n}^{neg} \in \mathbb {R}^{T}\). We use classical Gaussian filter to diffuse the mask and get the weight mask \(m^{neg}_{n} \in \mathbb {R}^{T}\), which is formulated as follows:

$$m_{n}^{neg}= Gaussian(t_{n}, \sigma) = \begin{cases} \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(1-dec(t))^{2}}{2\sigma^{2}}} & \text{ $if t = floor(t_{n}$) } \\ \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{dec(t)^{2}}{2\sigma^{2}}} & \text{ $if t = floor(t_{n})+1$ } \\ \frac{1}{\sigma \sqrt{2\pi}} &\text{ $if t = others$ } \end{cases},$$
(7)

where σ denotes the standard deviation of Gaussian kernel and its default value is 2.0, n ∈ [1,Nsam]. Thus, by the BM negative sampling, we can obtain respective weight mask \(M_{l}^{neg}=[m_{l,1}^{neg}, \cdots , m_{l,N_{sam}}^{neg}]\in \mathbb {R}^{N_{sam}\times T}\) for the left boundary window \([t^{gt}_{s} - \alpha \cdot d_{q}, t^{gt}_{s}]\) and \(M_{r}^{neg}=[m_{r,1}^{neg}, \cdots , m_{r,N_{sam}}^{neg}]\in \mathbb {R}^{N_{sam}\times T}\) for the right boundary window \([t_{e}^{gt},t_{e}^{gt}+\alpha \cdot d_{q}]\).

3.4.2 Contrastive learning

In this work, we perform the contrastive learning at positive and negative region levels. To facilitate the calculation, we implement the feature aggregation for each positive and negative region.

For the positive sample, given the target segment label \([t_{s}^{gt}, t_{e}^{gt}]\), we directly aggregate the multi-modal features of \(F^{h}\in \mathbb {R}^{d\times T}\) in region \([t_{s}^{gt}, t_{e}^{gt}]\) by element-wise sum operation. Thus, for each query, we get the feature of sole positive sample \(v^{pos}\in \mathbb {R}^{d}\).

For the negative sample, as shown in Figure 4(c), there are two negative sampling windows \([t_{s}^{gt}-\alpha \cdot d_{q},t_{s}^{gt}]\) and \([t_{e}^{gt},t_{e}^{gt}+\alpha \cdot d_{q}]\) that are semantically unaligned with the query qi. For each window, we sample 2Nsam times. Thus, we can obtain 2Nsam negative features. For the left negative sampling window \([t_{s}^{gt}-\alpha \cdot d_{q},t_{s}^{gt}]\), the negative features are calculated as follows:

$$\begin{array}{@{}rcl@{}} M^{neg}_{l}&=& BM Sampling([t_{s}^{gt} -\alpha \cdot d_{q}, t_{s}^{gt}]) \in \mathbb{R}^{N_{sam}\times T} \end{array}$$
(8a)
$$\begin{array}{@{}rcl@{}} V^{neg}_{l}&=& M^{neg}_{l}\cdot F^{h}\in \mathbb{R}^{d\times N_{sam}}, \end{array}$$
(8b)

where Nsam is the sampling number and α is the control parameter of sampling area.

Same as (8a) and (8b), the right negative features are calculated as follows:

$$\begin{array}{@{}rcl@{}} M^{neg}_{r}&=& BM Sampling([t_{e}^{gt}, t_{e}^{gt}+\alpha \cdot d_{q}]) \in \mathbb{R}^{N_{sam}\times T} \end{array}$$
(9a)
$$\begin{array}{@{}rcl@{}} V^{neg}_{r}&=& M^{neg}_{r}\cdot F^{h}\in \mathbb{R}^{d\times N_{sam}}, \end{array}$$
(9b)

In a nutshell, we get the sole positive sample and 2Nsam negative samples, and the loss optimization is introduced in Section 3.5.

3.5 Self-attentive regression

Thanks to the above contrastive sampling paradigm, the multi-modal feature is further restricted and refined. In this part, we leverage a self-attentive regression module on the multi-modal feature Fh to predict the temporal boundaries. Specifically, a two-layer MLPtemporal is used to compute the attention weight m of the cross-modal features over the temporal dimension. We use this m to obtain the fusion vector Z. After that, a two-layer MLPreg is used to predict the starting and ending timestamps (ts,te) corresponding to the query. The whole process is formulated as follows:

$$\begin{array}{@{}rcl@{}} m&=&Softmax(MLP_{temporal}(F^{h}))\in \mathbb{R}^{T} ; \end{array}$$
(10a)
$$\begin{array}{@{}rcl@{}} Z&=&{\sum\limits_{i=0}^{T}}m*{F_{i}^{h}} \in \mathbb{R}^{d}; \end{array}$$
(10b)
$$\begin{array}{@{}rcl@{}} (t^{s}, t^{e}) &=& MLP_{reg}(Z) \in \mathbb{R}^{2}. \end{array}$$
(10c)

3.6 Loss optimization

To optimize the proposed model, we design a multi-task loss \({\mathcal{L}}\). The total objective function is:

$$\mathcal{L} = \mathcal{L}_{reg} + \mathcal{L}_{att} + \mathcal{L}_{ctr},$$
(11)

where \({\mathcal{L}}_{reg}\) denotes the regression loss term, which directly evaluates the predicted starting and ending timestamps (ts, te). \({\mathcal{L}}_{att}\) denotes the attention loss term, which is used to align the self-attention pooling vector m with the location label along the temporal dimension. \({\mathcal{L}}_{ctr}\) denotes the InfoNCE Loss function which is used for contrastive learning.

Concretely, the loss function of \({\mathcal{L}}_{reg}\) is calculated as follows:

$$\mathcal{L}_{reg} = SL_{1}(t_{s}^{gt}-t_{s})+SL_{1}(t_{e}^{gt}-t_{e}),$$
(12)

where SL1 is Smooth L1 loss, \((t_{s}^{gt}, t_{e}^{gt})\) denote the ground-truth starting and ending timestamps, and (ts,te) denote the predicted starting and ending timestamps.

The loss function \({\mathcal{L}}_{att}\) is formulated as follows:

$$\mathcal{L}_{att} = -\frac{{\sum}_{i=1}^{T}\hat{m_{l}} \log(m_{i})}{{\sum}_{i=1}^{T}\hat{m_{l}}},$$
(13)

where m corresponds to m in (10a). Here, \(\hat {m}\in \mathbb {R}^{T}\) is another representation of the ground-truth label over the temporal dimension with the value of 1 in the temporal interval and 0 otherwise.

The loss function \({\mathcal{L}}_{ctr}\) is formulated as follows:

$$\mathcal{L}_{ctr} = -\frac{1}{B}\sum\limits_{i=1}^{B}\log \frac{e^{s(q_{i}, v^{pos}_{i})/\tau}}{{e^{s(q_{i}, v^{pos}_{i})/\tau}}+{\sum}_{n=1}^{2N_{sam}}{e^{s(q_{i}, v^{neg}_{i, n} )/\tau}}},$$
(14)

where s(⋅,⋅) denotes the calculation of cosine similarity, B denotes batch size, i denotes the query number in the batch and τ means temperature hyperparameter. qi represents the sentence-level vector of i −th query. For the i −th query, \(v^{pos}_{i}\) represents the aggregated feature of positive samples in the target location interval, \(v^{neg}_{i,n}\) denotes the n −th feature of negative samples coming from \(V^{neg}_{l}\) and \(V^{neg}_{r}\).

4 Experiments

Extensive experiments have been conducted on three benchmark datasets to evaluate the proposed method. And we also test the effectiveness of contrastive learning in this task. In this section, we first introduce the experimental setup, including datasets, evaluation metrics and implementation details. Then, we make comparisons and analysis with state-of-the-art methods. We also present an ablation study to investigate the contribution of each component in the proposed framework.

4.1 Experimental setup

4.1.1 Datasets

Following previous works [1, 12, 13], we experiment on three public benchmark datasets: Charades-STA [1], ActivityNet Captions [34] and TACoS [35]. Charades-STA [1] contains 6,672 daily life videos with the duration of 30.59 seconds on average. Each video has around 2.4 annotated moments and the average duration of the moment is 8.2 seconds. The dataset contains 16,1248 query-clip pairs and is split into training and testing parts with 12,408 pairs and 3,720 pairs, respectively.

ActivityNet Captions (abbreviated as ANet-Captions) [34] is a benchmark dataset for the task of dense video captions, which is built upon the ActivityNet [36] dataset. ANet-Captions is originally proposed for dense video understanding. Compared with Charades-STA [1], ANet-Captions is more challenging due to its two properties: one is that ANet-Captions contains longer videos, and the other is that the queries are often complicated. The ANet-Captions dataset consists of 20K videos along with 100K language queries. On average, each video is annotated with 2.5 queries. Limited to the unreleased “test” set, in this paper, we adopt the setting of “train” for training, “val 1” for validation, and “val 2” for testing in [37]. Thus, the dataset is split into the training/validation/testing sets of 37,421, 17,505, and 17,031 query-clip pairs.

TACoS [35] consists of 175 videos that are collected from the cooking room. The duration of each video is 4.79 minutes on average. Each video has 178 queries on average. Compared with Charades-STA and ANet-Captions datasets, TACoS has more dense queries on each video, causing more challenges. The TACoS dataset consists of 10,146, 4,589 and 4,083 query-clip pairs for training, validation, and test, respectively. The detailed statistics of these three datasets are listed in Table 1.

Table 1 The statistics of three public video grounding datasets. #Anns means denotes the number of query-moment pairs, Lvid denotes the average length of videos, Lquery denotes the average length of query, Lmoment denotes the average length of queried moment

4.1.2 Evaluation metrics

Following previous works [1, 10] on video moment retrieval, we adopt “R@N, IoU@” as the evaluation metrics. The metric “R@N, IoU@” [1, 13] calculates the percentage of samples having larger temporal Intersection over Union (tIoU) than threshold in the top-N predicted segments. The higher the value of mIoU, the more accurate the prediction result of the model. Since the proposed method is proposal-free, we report all the results at R@1. We abbreviate it as “IoU@” in the following tables. Besides, “mIoU” denotes the average IoU for all the test queries. The pre-set thresholds can be set to {0.5,0.7} on Charades-STA and ANet-Captions, and {0.3,0.5} on TACoS.

$$R@N, IoU@\theta=\frac{1}{N} r(\theta, q_{i}),$$
(15)

where qi is the i −th predicted segment, r(⋅) denotes the tIoU calculation [1], N is the number of predicted segments, and is a pre-set threshold.

4.1.3 Implementation details

For fair comparison, we use the C3D network [38] for ANet-Captions dataset Footnote 1, C3D [38] and I3D [5] networks for Charades-STA dataset, and VGG network [39] for TACoS dataset to extract visual features. To facilitate model training, we uniformly sample segments from each video with a fixed length T = 128. As for language features, we first transform all words in each query to lowercase and extract the GloVe word embedding [31] with the dimension of 300. Both the visual features and textual features are linearly mapped into 512-dim vectors. Finally, about training settings, we use Adam optimizer [40] to optimize the proposed network with a learning rate of 1e − 4. The batch size is set to 100. For contrastive learning, the sample point Nsam of BM sampling is set to 32. For LGCTM, the kernel size of 1D convolution layer is set to 15, the parameters N and M are set to 2 and 2, respectively. The temperature parameter τ in contrastive learning loss term is set to 1e − 7. Note that, the contrastive learning is only engaged in the training process, and it performs without any memory consumed in the inference process.

4.2 Comparison with state-of-the-arts

We compare the proposed STCNet with the following state-of-the-art methods: 1) Proposal-based methods: CTRL [1], MCN [41], ACRN [8], MAC [42], CMIN [43]; SCDM [12], TGN [20], CBP [44]; 2) Proposal-free methods: ABLR [13], LGVTI [9], BPNet [45], CPNet [10], PMI-LOC [46].

The experimental results on Charades-STA [1], ANet-Captions [34] and TACoS [35] datasets are listed in Tables 2, 34, where the best result in each column is highlighted in bold. From Table 2 on the charades-STA dataset, we can see that with C3D features, although the results of our method are not optimal, but the results are competitive with the current state-of-the-art methods, and the same are the results with the I3D features. The results of the ANet-Captions dataset are summarized in Table 3, our proposed method is higher than CPNet 0.22% and 0.38% at R@0.5 and mIoU, respectively. Our mIoU superiority is more obvious than IoU. Table 4 compares the performances on the TACoS dataset, in which video samples are collected from cooking room. Compared with VSLNet [23], our method achieves the performance improvement 1.39%, 9.23% and 2.14% at R@0.5, R@0.3 and mIoU than it, respectively.

Table 2 Comparison results with state-of-the-art methods on Charades-STA dataset
Table 3 Performance comparison with state-of-the-art methods on ANet-Captions dataset
Table 4 Comparison results with state-of-the-art models on TACoS dataset

4.3 Ablation study

4.3.1 Main components of STCNet

In this subsection, we experiment the ablation study of each component of STCNet. We test several variants of our model: 1) STCNet: our complete model based on both local-global context modeling and contrastive learning includes all the loss terms, 2) w/o contrastive: STCNet does not use contrastive learning module and contrastive loss term, 3) w/o local: STCNet does not use local contexts in the local-global context modeling module, 4) w/o global: STCNet does not use global contexts in the local-global context modeling module. The ablation studies are experimented on the ANet-Captions dataset.

The results of the ablation experiments are shown in Table 5. Compared with the full STCNet, the metrics R@0.3, R@0.5, R@0.7, the mIoU of “w/o contrastive” decreases by 4.88%, 3.76%, 2.93%, and 3.16%, respectively. The severe performance degradation happens on “w/o contrastive”, which demonstrates the superiority of contrastive learning module in STCNet. Therefore, the contrastive learning of positive and negative samples can better enhance the target temporal representation for answer prediction.

Table 5 Ablation studies of main components in our approach on ANet-Captions dataset

For the ablation study of local context modeling, the performance of “w/o local” is dropped by a large min (e.g., mIoU from 41.03 to 38.10). These results show that it is not enough to merely rely on the global contexts in multi-modal features, but also need to model the fine-grained relationships between multi-modal features to achieve accurate moment retrieval. For the global context modeling, the performance of “w/o global” decreases significantly (e.g., mIoU from 41.03 to 37.83). These results show that the model only relies on local context modeling and will pay more attention to local information, thus ignoring the overall semantics. In a nutshell, the results of “w/o local” and “w/o global” prove the effectiveness of the proposed LGTCM module. This module first uses 1D convolution to learn the relationship between adjacent moment points (local features) in multi-modal features and then uses the multi-head self-attention to model the global context relationship, which can effectively model multi-modal features.

4.3.2 Albation study for contrastive learning

We also analyze the role of two hyperparameters in the contrastive learning module - temperature hyperparameter (τ) and contrastive sampling hyperparameter (α). Tables 6 and 7 show their impacts on the ANet-Captions dataset. Table 6 lists the results of parameter τ ∈{0.1,0.2,0.4,0.6,0.8,1.0,1.2}, from which we can see that as τ continues to increase, the trend of each metric value shows an upward trend. This is due to the fact that τ controls the discrimination capability of model between positive sample and negative samples as stated in [56]. α denotes control parameters of the sampling area and its impact is shown in Table 7. It can be seen that there is not a linear trend, thus we set an optimal empirical value of α for each dataset. The utilization of positive and negative samples around the target temporal interval is achieved in a more reasonable way to further improve the performance of the model.

Table 6 Ablation results of the temperature hyperparameter τ in the contrastive loss on ANet-Captions dataset
Table 7 Ablation results of the constrastive sampling hyperparameter α on ANet-Captions dataset

4.4 Qualitative results

Figure 5 shows two examples selected from the ANet-Captions dataset. Compared with several variants of our method, the full STCNet is more accurate in predicting the starting timestamp (ts) and the ending time (te) in the video under the queried sentence. For example, in example Q1, “w/o constrastive” predicts 5.26 seconds and 2.39 seconds earlier than ts and te predicted by STCNet, respectively. In example Q2, this case happens 11.05 and 18.17 seconds earlier, respectively. As mentioned previously, our contrastive strategy focuses on the representation learning between similar but different instances, and enlarges the representative difference between non-similar instances. In addition, the performance drops obviously without using local or global context modeling. If the model does not use local or global contexts, it will ignore the adjacent temporal changes and the whole story-line and cannot predict accurate boundaries. For example, the video in example Q2 displays a man with a blue shirt, and there are no obvious scene changes. Thus, it is hard for the model to identify where the beginning of queried action is; the model fails to locate the target and only predicts a result that is almost as long as the video.

Figure 5
figure 5

Qualitative results on ANet-Captions dataset

5 Conclusion

In this paper, we propose a spatiotemporal constrastive learning approach named STCNet for video moment retrieval. Based on the feature encoding and fusion of video and query, we first perform the local-global contextual modeling of multi-modal features, and then use a spatiotemporal contrast learning module to enhance the target temporal feature representation. Experiments on Charades-STA, ActivityNet Captions and TACoS validate the effectiveness of our approach.