Introduction

Because the ambient noise cross-correlation does not depend on the location of the source and can be continuously collected, related methods are widely used in earth science research. They have achieved many meaningful results in ambient noise tomography (Shapiro et al. 2005; Lin et al. 2008; Chen et al. 2018; Zhang et al. 2020a, b), temporal variations of the surface wave velocity (Grêt et al. 2006; Wegler and Sens-Schönfelder 2007; Brenguier et al. 2008) and source location (Xie et al. 2020). The travel time of ambient noise cross-correlation plays an important role in these researches. Especially, when noise cross-correlations are used to verify the seismic instrument performance, accurate travel time picking is crucial (Stehly et al. 2007; Gouédard et al. 2014; Ye et al. 2018). There are many methods to measure travel time (Stehly et al. 2007; Tsai 2009; Djebbi and Alkhalifah 2014), but these methods are either difficult to be applied to data with low signal-to-noise ratio (SNR), or make some assumptions which fail to be achieved in many realistic situations, or require a lot of complex calculations. For example, the dispersion curve of the surface wave can be used to obtain the travel time of the ambient noise cross-correlation in most cases, but it is difficult to pick the accurate travel time when the frequency band is narrow and the frequency is high. This problem becomes even more prominent when the SNR of the ambient noise is low. The travel time of correlation with high SNR is the position of the maximal amplitude, but it will be disturbed or even submerged by noise for low SNR data. Stacking short-term noise cross-correlation functions over long time periods can improve the SNR, but the longer the stacking time periods, the lower the temporal resolution (Yang et al. 2022). Methods proposed by Djebbi and Alkhalifah (2014) and Tsai (2009) use formulas to obtain accurate travel time but make some assumptions which can fail to be achieved in many realistic situations. So, a method which is more general and effective and can pick the travel time accurately is necessary.

In recent years, deep learning has been widely used in seismology. Many deep learning methods are used in the study of seismic waves (Perol et al. 2018; Viens and Van Houtte 2020; Zhang et al. 2020a, b; Song et al. 2021), but there are few models for ambient noise. In particular, convolutional neural network (CNN) and recurrent neural network (RNN) are widely used in earthquake location (Huang et al. 2018; Mousavi et al. 2019), seismic wave denoising (Zhu et al. 2019; Novoselov et al. 2022) and seismic phase picking (Zhou et al. 2019; Zhu et al. 2019; Chai et al. 2020). Among these applications, seismic phase picking is similar to our task, but models for phase picking are not suitable for travel time measurement. Different from seismic waves, ambient noise consists mostly of elastic surface waves. The seismic phase picking is to extract the first arrival of the body wave, while the travel time picking is to pick the arrival of the surface wave. Surface waves travel more slowly through earth material at the planet’s surface and have lower frequency, larger amplitudes and longer wavelengths compared with the body wave. In addition to the difference of the seismic wave and ambient noise cross-correlation, there are some problems with applying CNN and RNN to seismic data. For CNN, the receptive field is subjected to the kernel size, so most of CNN models stack many layers to expand the receptive field which makes the number of parameters increase. For RNN, when it is applied to long sequences, it is difficult to train. Therefore, many RNNs for phase picking often gather multiple points into one point to shorten the length of the seismic wave, which will cause the loss of data information. Transformer proposed by Vaswani et al. (2017), which relies entirely on self-attention mechanism, can well solve the long-term dependence of RNN. And the self-attention mechanism can be considered as a kind of global receptive field. So, we consider to combine CNN and Transformer to pick the travel time. When a sequence is fed into Transformer, each time step should be a vector rather than a scalar. Thus, a CNN module can be used to increase the number of channels of the cross-correlation, which makes each time step from a scalar to a vector. And CNN can also properly shorten the sequence length. The core of the whole model is still Transformer.

Our basic idea is to train a combined CNN and Transformer model on a large dataset to pick the travel time of ambient noise cross-correlation. There are two CNN blocks, the first one is used to expand the number of channels and shorten the sequence length and the second one is used to restore the sequence length. And both of them can perform preliminary feature extraction. In order to append more initial information, we add a second channel to the original sequence. After inputting the positive and negative cross-correlation lags into the model, the output is a vector of probabilities, and the value in the vector is between 0 and 1 indicating the probability that each sample in the cross-correlation is travel time. The position of the maximum value in the probability vector is the travel time. We expect our model can be applied to cross-correlations of different regions, and especially the travel times submerged by noise can also be picked, and most of the picked positive and negative travel times have no difference. To investigate the applicability of our method, we apply the trained model on another ambient noise cross-correlation dataset, Global Empirical Greens Tensors, and it shows good generalization ability.

Data and method

Data

We used an ambient noise cross-correlation dataset from Incorporated Research Institutions for Seismology (IRIS), ANCC-CIEI (see Data and Resources), which contains data from 622 USArray Transportable Array stations west of 105°W longitude (Fig. 1) between 2005/1/1 and 2010/12/31. There are 171,120 ambient noise cross-correlation waveforms (in SAC format) in the dataset. Four traditional ambient noise data processing procedures were used to process these correlations: (1) single station data preparation, (2) cross-correlation and temporal stacking, (3) measurement of dispersion curves and (4) quality control. The sampling rate of the cross-correlation is 1.0 (in sec), the number of samples is 7201, the lag length is 3600, and the bandpass is 15–50 s period. We define SNR as the ratio of the peak signal within the signal window to rms noise in the trailing noise window (Bensen et al. 2007). According to this calculation method, we calculated the average SNR of each data and found that 57.9% of cross-correlations had SNR larger than 30, 29.9% of cross-correlations had SNR larger than 15 and less than 30, and 12.2% of cross-correlations had SNR less than 15.

Fig. 1
figure 1

Map of 622 USArray virtual array stations west of 105° W for which cross-correlations are calculated

We randomly selected 80,000 cross-correlations from the dataset, of which 60,000 had SNR larger than 15, and the other 20,000 had SNR less than 15. We add random Gaussian white noise to 20,000 correlations for data augmentation. Among these 100,000 data, 95,000 randomly chosen data are used for training, and the rest (5000) are used for validation. We selected 2000 ambient cross-correlations from another dataset (Global Empirical Greens Tensors, see Data and Resources) for testing the performance of the model after training. So, the final dataset contains 102,000 cross-correlations, which was split into a training set (93%), a validation set (5%) and a test set (2%).

Data with more initial information can make the model perform better, so we calculated the short-term average/long-term average ratios (STA/LTA) of cross-correlations. The recursive STA/LTA algorithm, which has faster calculation and higher signal pickup sensitivity than the traditional STA/LTA algorithm, is used here. The recursive STA/LTA ratio is calculated as follows:

$$\begin{array}{*{20}c} {{\text{CF}}_{i} = A_{i}^{2} - A_{i - 1} \cdot A_{i + 1} } \\ \end{array}$$
(1)
$$\begin{array}{*{20}c} {{\text{STA}}_{i} = {\text{STA}}_{i - 1} + \frac{{{\text{CF}}_{i} - {\text{STA}}_{i - 1} }}{{N_{s} }}} \\ \end{array}$$
(2)
$$\begin{array}{*{20}c} {{\text{LTA}}_{i} = {\text{LTA}}_{i - 1} + \frac{{{\text{CF}}_{{i - N_{s} - 1}} - {\text{LTA}}_{i - 1} }}{{N_{l} }}} \\ \end{array}$$
(3)
$$\begin{array}{*{20}c} {R_{i} = \frac{{{\text{STA}}_{i} }}{{{\text{LTA}}_{i} }}}, \\ \end{array}$$
(4)

where \(A_{i}\) is the amplitude of \(i\) th sample, \({\text{CF}}_{i}\) is the characteristic function of \(i\) th sample, and \(N_{s}\) and \(N_{l}\) are the time windows in number of samples to compute the short-term average and long-term average, respectively. Here, \(N_{s}\) is 100 and \(N_{l}\) is 1200.

Because our method picks the travel times of the positive correlation lag and the negative correlation lag, respectively, we only select the positive or negative part of each cross-correlation as training data. Positive parts of 50,000 cross-correlations are selected, and negative parts of the other 50,000 are selected. The corresponding STA/LTA curves of negative part and positive part should be the same part of the whole, so the starting calculation positions of the STA/LTA method of the positive and negative part are different. The STA/LTA ratio of the positive part is calculated from the right side of the cross-correlation, and take the right half of it (the part between the two dashed lines in Fig. 2b). The STA/LTA ratio of the negative part is calculated from the left side of the cross-correlation and take the left half of it (the part between the two dashed lines in Fig. 2c). According to this calculation method and selection method, both positive and negative parts correspond to the first half of the overall STA/LTA curves. Since \(N_{l}\) is 1200, the STA/LTA ratios of the first 1200 samples are 0 by default. In order to eliminate the influence of these 0 values during data normalization, it is needed to remove them and the corresponding samples of positive or negative part (Fig. 2a). We combine the positive or negative correlation lag and its corresponding STA/LTA ratio into a two-channel data.

Fig. 2
figure 2

a The cross-correlation, b the STA/LTA ratio corresponding to the positive correlation time, c the STA/LTA ratio corresponding to the negative correlation time. For the positive part used for training, take the part between the middle dashed line and the right dashed line in a, and the corresponding STA/LTA ratio is the part between the two dashed lines in b. For the negative part used for training, take the part between the middle dashed line and the left dashed line in a, and the corresponding STA/LTA ratio is the part between the two dashed lines in c

We use the form of triangular to label travel times. In this form, the probability of travel time is set to 1 and linearly decreases to 0 within 20 samples before and 20 samples after travel time (Fig. 3a). Finally, there are 100,000 data for training, and each data has two channels: (1) the positive correlation lag or the negative correlation lag (Fig. 3a); (2) the corresponding recursive STA/LTA ratio (Fig. 3b).

Fig. 3
figure 3

Data used for training and the label. a the first channel of data: the positive or negative part of the cross-correlation, b the second channel of data: the corresponding recursive STA/LTA ratio, c the label

Method

CNN is a kind of feedforward neural network with deep structure. CNNs, consisting of convolutional layers, activation layers and pooling layers, have powerful feature extraction capabilities and are mainly used for multi-channel data processing. Compared with fully connected networks, CNNs have a parameter-sharing scheme, which can largely reduce the number of trainable parameters and thus makes CNNs go deeper to learn the complex relationship between inputs and outputs. Many classic CNNs (such as VGG and ResNet) are mainly used for image recognition and object localization, and have achieved excellent performance (Simonyan and Zisserman 2014; He et al. 2016).

Transformer proposed by Vaswani et al. (2017) is the state-of-the-art sequential model in the natural language processing (NLP) field. Transformer entirely relies on self-attention mechanism. The role of the self-attention mechanism is to find out the dependency between different samples (time steps) in a sequence. It can assign the weight of each time step in a sequence, so that the model can pay more attention to the key parts of the overall data. Transformer solves the long-term dependence and the difficulty of training of RNN and the performance of it can reach or even exceed RNN. Transformer consists of an encoder and a decoder. The encoder part and the decoder part are designed for discriminative tasks and generative tasks, respectively. The encoder has three identical inputs: Query (Q), Key (K), and Value (V). The output is the self-attention calculated by Q, K, and V, which is calculated as follows:

$$\begin{array}{*{20}c} {{\text{Attention}}\left( {Q,K,V} \right) = {\text{softmax}}\left( {\frac{{{\text{QK}}^{T} }}{{\sqrt {d_{k} } }}} \right)V}, \\ \end{array}$$
(5)

where \(d_{k}\) is the length of the vector of each time step.

The encoder and decoder are composed of multiple identical layers, and each layer has two sub-layers. The first sub-layer is a multi-head self-attention mechanism, which performs different linear transformations on Q, K, and V through h heads, and then concatenates the h attention heads:

$$\begin{array}{*{20}c} {{\text{MultiHead}}\left( {Q,K,V} \right) = {\text{Concat}}\left( {{\text{head}}_{1} , \ldots ,{\text{head}}_{h} } \right)W^{O} }, \\ \end{array}$$
(6)

in which \({\text{head}}_{i} = {\text{Attention}}\left( {{\text{QW}}_{i}^{Q} ,{\text{KW}}_{i}^{K} ,{\text{VW}}_{i}^{V} } \right)\). The second sub-layer is a fully connected feedforward network, which performs a nonlinear transformation on multi-head attention.

Due to the powerful feature extraction capability of CNN and the sequence processing capability of Transformer, our model combines them. CNN is used to expand the number of channels of the sequence and shorten or restore the length of the cross-correlation. Transformer is used to process the sequence, and uses the self-attention mechanism to make the model pay more attention to the travel time part of the cross-correlation. We only used the encoder of Transformer because the output (self-attention matrix) of it is what we need. Our model consists of four parts: downsampling block (CNN), attention block (the encoder of Transformer), upsampling module (CNN) and output block (fully connected network). The specific parameters and changes of data shape are shown in Table 1. The input shape of the model is (n, 2, 2400), corresponding to n data with two channels and 2400 samples for each channel. The first channel of the data is the positive or negative correlation lag, and the second channel is the corresponding recursive STA/LTA ratio. The output of the model is a probability vector of length 2400, indicating the probability that each sample is a travel time. Then, we will introduce the functions of the first three blocks in detail.

Table 1 Parameters of the network and changes of data shape

The main purpose of the first CNN block (downsampling block) is to prepare for the input of the following attention module: expand each sample to a fixed length vector and shorten the data length. An original sequence has only two channels, that is, the vector dimension of each time step is 2. Such a low-dimensional vector is not suitable for input to Transformer. Therefore, we use CNN to extend the number of channels of the sequence. The downsampling block contains four convolution layers. The size of each convolution layer is shown in Fig. 4; the padding scheme is same; and only the first two convolution layers are followed by pooling layers. Although pooling layers add some translation invariance to the model, using too many pooling layers would lead to loss of useful feature information. Though the downsampling block can perform preliminary information extraction on the input data; it is mainly used to shorten the data length (from 2400 to 600) and increase the number of data channel (from 2 to 32). When talking about Transformer, we will further explain why the length is shortened and the vector dimension of each time step is expanded.

Fig. 4
figure 4

Network architecture. The red rectangles are 1D convolutional layers. The convolutional layers read as (number of kernels) kr (kernel size). Maxpooling/2 means that the convolutional layer is followed by a pooling layer to shorten the data to 1/2 the original length. 2 Upsampling means that the convolutional layer is followed by the upsampling layer, and the data length is expanded to twice the original length. The purple rectangle is the attention block. The blue rectangle is the FC output block

In NLP tasks, the input of the encoder is not each word in a sentence, but the vector obtained by each word (time step) after embedding and positional encoding. In order to obtain a similar input, we need to transpose the output of the downsampling block. The output shape of the downsampling block is (n, 32, 600), representing that each data has 32 channels and the length of each channel is 600. We transpose the output from (n, 32, 600) to (n, 600, 32), indicating that there are n sequences, each sequence has 600 time steps which are called tokens in NLP tasks, and the vector dimension of each time step is 32. In the traditional Transformer, the vector dimension of each token is much larger than 32, such as 512 and 1024. But the vector dimension of 32 is enough for us, because seismic data is not particularly complicated compared with language. Many single CNN seismic neural networks also expand the data channel to 32 (Zhu and Beroza 2019; Zhou et al. 2019). After that, positional encoding is added to the vector of each time step to generate the input of the attention block.

Without the constraints of memory and computing power, Transformer can theoretically encode infinitely long sequences. However, the amount of computation of attention is huge, and the computational complexity has an O(n2) relationship with the sequence length, so as the length of the sequence increases, the memory and amount of computation will increase rapidly. That’s why in the previous block, we use two pooling layers to reduce the sequence length to 600. In other words, each time step in the sequence output by the downsampling block corresponds to four samples in the original sequence. But each time step in the output sequence cannot correspond to a particularly large number of samples in the original sequence. If the length of the input sequence is excessively shortened, features of it may be lost, which is why we only shorten the original sequence to a quarter. For the output of the attention block, we also need to transpose it back to the shape (n, 32, 600) for the input of the next CNN upsampling block.

The main purpose of the second CNN block (upsampling block) is to restore the time series to the original length and further extract information after obtaining the attention distribution of a sequence. Because after the downsampling block and Transformer block, the length of the sequence is shortened to 600, and our output is a probability sequence with the same length as the original data, so we use CNN to restore the sequence to the original length. The upsampling block also contains 4 convolution layers. The size of each convolution layer is shown in Fig. 4, the padding scheme is same and the last two convolution layers are preceded by an upsampling layer to restore the data to the original length of 2400.

The activation function of the fully connected network output block is sigmoid. The output block is used to output a probability sequence. Each value in the sequence represents the probability that the corresponding sample in the cross-correlation is travel time.

When picking the travel time of an ambient noise cross-correlation, some simple preparatory work is needed to generate data suitable for the network input. First, we need to calculate recursive STA/LTA ratios starting from the left (negative part) and the right (positive part) of the cross-correlation, respectively. Then, scale the amplitude of the cross-correlation to − 1 ~ 1, and the amplitude of the two recursive STA/LTA ratio curves to 0 ~ 1. Because the travel times of negative correlation lag and positive correlation lag are picked, respectively, we split the positive and negative parts of the cross-correlation. For the negative part, we choose the left half of the recursive STA/LTA ratio which is calculated from the left of the cross-correlation (Fig. 2c) and combine it with the negative part to form the negative data with two channels. And for the positive part, we choose the right half of the recursive STA/LTA ratio which is calculated from the right of the cross-correlation (Fig. 2b) and combine it with the positive part to form the positive data with two channels. Then, input the positive and negative data with two channels (the first channel is the negative or positive part which is shown in Fig. 3a and the second channel is its corresponding recursive STA/LTA ratio which is shown in Fig. 3b) into the model. The model will output a probability sequence with the same length as the input data. The location of the maximum amplitude in the sequence is the travel time.

Result and discussion

For convolutional and fully connected layers, all the weight parameters were initialized with a Xavier uniform initializer and biases were set to 0. The optimizer is adaptive moment estimation with weight decay (AdamW), the loss function is binary cross entropy for binary classification problems, the number of epochs is 100, and the learning rate is 1e-4 for the first 60 epochs, and then becomes 1e-5. We adopt three measures to avoid the occurrence of overfitting: (1) data augmentation. 20% of the data used for training is added with random Gaussian white noise. And for a cross-correlation function, we randomly select its positive or negative part; (2) dropout. There is a dropout layer with a dropout probability of 0.3 between two fully connected layers in the output block; and (3) AdamW optimizer. AdamW is a combination of adaptive moment estimation (Adam) and L2 regularization and weight decay. L2 regularization and weight decay are both useful methods to prevent overfitting. The model took 11 h to complete the training using a TITAN Xp GPU under the Pytorch framework (Paszke et al. 2019). Figure 5 shows the evolution of the training process. In the first few epochs, the loss decreases rapidly; starting from about the 15th epoch, the rate of decrease becomes slower, and after ~ 70 epochs, the model converges to the optimal solution (Fig. 5). The loss of training set and validation set are very close throughout the process. After 100 epochs iterations, the losses of the training set and validation set are 0.030 and 0.032, respectively, and the model has been able to achieve good performance.

Fig. 5
figure 5

Loss curves of training set and validation set. a the training loss against epoch number, b the validation loss against epoch number

To show that our model is superior to other traditional models and methods, we decide to compare the method with a single CNN model (without the attention block) and another method for travel time picking (maximum amplitude selection). The reason for comparison with CNN is that it is a traditional neural network architecture for seismic phase picking. There are 5 metrics to evaluate the performance of our model: precision, recall, F1 score, mean (μ) and standard deviation (σ) of time residuals (Δt). Time residual is the difference between the picked travel time and the true travel time. Precision, recall and F1 score are standard metrics of performance, which are defined as:

$$\begin{array}{*{20}c} {{\text{Precision}} = P = \frac{{T_{p} }}{{T_{p} + F_{p} }}} \\ \end{array}$$
(7)
$$\begin{array}{*{20}c} {{\text{Recall}} = R = \frac{{T_{p} }}{{T_{p} + F_{n} }}} \\ \end{array}$$
(8)
$$\begin{array}{*{20}c} {F1 = 2\frac{P \times R}{{P + R}}}, \\ \end{array}$$
(9)

where \(T_{p}\) is the number of true positive, \(F_{p}\) is the number of false positive and \(F_{n}\) is the number of false negative. Travel time residuals (Δt) that are less than 2 s are counted as true positives, and the larger residuals are considered false positives. F1 score is a criterion to balance P and R. 500 cross-correlations with high SNR (SNR ≥ 15) and 500 cross-correlations with low SNR (SNR < 15) were used to compare our model with the single CNN model. The results are shown in Table 2. Since the maximum amplitude selection must be able to pick a travel time of each cross-correlation, precision, recall and F1 score of it are equal. For high SNR data, the result of our method is slightly better than that of CNN. But for low SNR data, our method achieved significant improvements. And our model is markedly better than traditional method on the whole. We can draw three conclusions from Table 2: a. attention block is important; b. the method is effective for low SNR data; and c. the method is better than traditional model and method.

Table 2 Evaluation metrics on the test dataset

We applied our model on Global Empirical Greens Tensors (see Data and Resources), another ambient noise cross-correlation dataset from IRIS. The dataset contains cross-correlations of two different scales: global scale and continental scale. Cross-correlations at the global scale are extracted from seismic data recorded by GSN, GEOFON and other broadband stations. Cross-correlations at the continental scale are from selected broadband regional networks and temporary deployments. The time series is normalized in the time and frequency domains with a frequency-time-normalization method (Shen et al. 2012). The region of the cross-correlations at the continental scale we choose is North America. The sampling rate of the cross-correlation at the global scale is 1.0 (in sec), the number of samples is 14,401, the lag length is 7200, and the bandpass is 20–600 s period. The sampling rate of the cross-correlation at the continental scale is 1.0 (in sec), the number of samples is 72,01, the lag length is 3600, and the bandpass is 8–300 s period.

Figure 6 illustrates the results of travel time picking for randomly chosen cross-correlations from the test dataset. It can be seen that our model performances well on the cross-correlations at the continental scale (Fig. 6a–c) and the global scale (Fig. 6d–h), which also shows that the model has a good generalization ability. For cross-correlations with low SNR, the model can still pick the travel times (Fig. 6e, h). When the SNR is very low, although the change of the probability curve on both sides of the travel time fluctuates, the curve still has peaks (the positive part in Fig. 6h). The negative part in Fig. 6b, the positive part in Fig. 6g and the positive part in Fig. 6h whose travel times are submerged by noise also suggest that our model does not simply pick the maximal amplitude.

Fig. 6
figure 6

The performance of the model on the cross-correlations at different scales and with different SNR. ac are cross-correlations at the continental scale (North America), dh are cross-correlations at the global scale

Surface wave velocity varies significantly in different regions due to the geological structure and other factors. In order to show the accuracy of travel time picked by our mode, we selected cross-correlations in a small area (the maximum interstation distance is less than 1000 km) as the test data to verify whether the travel time picked by the model is accurate. First, the model was used to pick the travel times of 100 cross-correlations with high SNR (SNR > 50) in the area. We calculated the surface wave velocity of each cross-correlation according to the interstation distance and the picked travel time, and then calculate the average surface wave velocity(3.106 km/s). After that, we used the model to pick the travel times of 500 cross-correlations with low SNR (10 < SNR < 30). If the absolute value of the difference between the average surface wave velocity and the surface wave velocity calculated according to the interstation distance and the picked travel time is greater than 0.5 km/s, the result is considered incorrect. We found that 97.2% of the 500 low SNR cross-correlations’ travel times were picked accurately.

We then looked for the lowest average SNR of the cross-correlation that the model could pick the travel time. The specific steps are as follows: first, select a low benchmark SNR to find the lowest SNR more quickly, then choose a cross-correlation with an SNR lower than the benchmark SNR, and observe whether the model can accurately pick up the travel time. If the travel time can be picked, and the probability curve on both sides of the travel time does not fluctuate greatly, continue to select a cross-correlation with lower SNR until the model cannot effectively pick the travel time. The picking results of the model for the cross-correlations from high to low SNR are shown in Fig. 7. We rounded SNR to one decimal place and found that the lowest SNR of the cross-correlation whose travel time can be accurately picked by the model is 9.3 (Fig. 7e). When the SNR is lower than 9.3, the performance of the model deteriorates (Fig. 7f). As most studies only use cross-correlations with SNR larger than 10 (Luo et al. 2020), the performance of our model is enough. The travel times of cross-correlations picked by network in Fig. 7b, e, f demonstrate that our model does not simply pick the maximal amplitude.

Fig. 7
figure 7

Look for the lowest average SNR of the cross-correlation that our model can pick the travel time. The SNR decreases continuously from a to f. The SNR of the cross-correlation function in e is the lowest SNR we found of 9.3. The SNR of the cross-correlation in f is 8.4, and the model’s performance is poor

The travel time of an ambient noise cross-correlation is affected by the noise source, geological structure and instrument response (Stehly et al. 2007): (1) a physical change in the medium would result in either a faster or slower travel time measured in both positive and negative correlation lag; (2) a clock error in one of the two stations or a change of the phase response of one of the sensors would product a time-shift of the whole cross-correlation relative resulting in a larger travel time in the positive time and a smaller apparent travel time in the negative time or vice versa; (3) a change of the spatial distribution of noise sources in the medium should affect the positive and negative correlation time independently because the positive and negative correlation time are sensitive to noise sources located in different regions. These factors would eventually cause the positive and negative travel times of a cross-correlation to be different, so we make changes to a cross-correlation to simulate the factors previously described. The data are changed by filtering and shifting, and the picking result is shown in Fig. 8. The original cross-correlation is filtered between 200 and 400 s period. Due to the different sources of ambient noise in different periods, we bandpass filtered the cross-correlation to 8–20 s period and 20–40 s period, respectively (Fig. 8b, c). There are two kinds of translation transformations. The first is the left and right translation of the whole waveform (Fig. 8d, e), and the second is to move the positive and negative lags closer to zero time and away from zero time. For the second translation transformation, we added random noise to the waveform after moving to fill in the values in the vacancies created by the translation (Fig. 8f, g). Then, we used the model to pick the travel time, and the model still performed well.

Fig. 8
figure 8

The performance of the model on the transformed cross-correlations. a the original cross-correlation, b, c cross-correlations after bandpass filtering, d, e the cross-correlation function is shifted left and right as a whole, f, g the positive and negative lags are moved closer and away from the zero time

In the case of a perfectly isotropic distribution of sources, the cross-correlation between two stations is symmetric. If the density of sources is larger on one side than on the other, the cross-correlation is not symmetric in amplitude. But no matter whether the noise source density is the same or not, the positive and negative travel times are the identical (Stehly et al. 2006). While the previously mentioned factors will make the positive and negative times of cross-correlations different, for most cross-correlations, the positive and negative travel times are identical. Therefore, the identical travel times of positive and negative parts is also an important manifestation of the accuracy of the model picking. We tested our model on 1000 cross-correlations for the difference of the picked travel times of the positive and negative part. Among these 1000 cross-correlations, the minimum interstation distance is 36.25 km and the maximum is 2223.16 km; the minimum SNR is 12.3 and the maximum is 248.6. We counted the number of cross-correlations with different travel times of positive and negative parts in perspectives of interstation distance and SNR, and the results are shown in Fig. 9a and b. It can be seen that the relationship between the SNR and the occurrence of different travel times of the two lags is small, while the distance between the stations has a significant relationship with the occurrence of different travel times of the two lags. When the interstation distance exceeds 1100 km, the number of cross-correlations with different positive and negative travel times increases significantly (Fig. 9b). We think it is because the ANCC-CIEI used for model training is a small-scale dataset with data from stations in the area shown in Fig. 1. So, when the model is applied to cross-correlations at a small scale, it has better performance. But when the model is applied to cross-correlations at a large scale, the number of cross-correlations with different travel times of the two lags rises. As can be seen in Fig. 9d, 83.1% of the cross-correlations used for model training have an interstation distance less than 1300 km, while most of the cross-correlations used for testing have an interstation distance greater than 1000 km (74%, Fig. 9e). Even so, the number of cross-correlations with different travel times of the two lags only accounted for 9.8% of the total. Then, we counted the number of different differences (the difference is the absolute value of the travel time of positive part minus the travel time of negative part) and found that most of the differences are within 2 s (Fig. 9c) which is acceptable for cross-correlations with interstation distance greater than 1100 km. The mean of the difference is 0.175 s and the standard deviation is 0.713 s.

Fig. 9
figure 9

a, b distribution of cross-correlations with different travel times of positive and negative parts in terms of interstation distance and SNR, c distribution of differences, d distribution of interstation distance of the cross-correlations used for training the model, e distribution of interstation distance of the cross-correlations used for testing

To further show the performance of our model, two well-known neural network (NN) phase pickers, PhaseNet (Zhu et al. 2019) and EQTransformer (Mousavi et al. 2020), were used to compare the ability to pick travel time with our model. PhaseNet is a U-net based NN model and EQTransformer which has the state-of-the-art earthquake detection and phase picking performance is a model combining CNN, RNN and Transformer. We tested with 200 ambient noise cross-correlations with different SNR and interstation distances and fed the positive and negative lag of each data separately into models. While the two models can pick a phase near the travel time, there was a significant deviation between picked phase and travel time, and the lower the SNR of the data, the greater the deviation (average deviation was 37.88 s). Figure 10 shows the result of picking travel time of a lag with PhaseNet, EQTransformer and our model, from which it can be seen that the travel time picked by NN phase pickers deviated from the real travel time. Though EQTransformer has the state-of-the-art phase picking performance, on the data with high SNR, the phase picked by PhaseNet was closer to the true travel time. For ambient noise cross-correlation with low SNR, both models identified it as noise, and the lowest SNR of the data that EQTransformer and PhaseNet can pick phases were ~ 11.31 and ~ 14.07, respectively. In addition, when the interstation distance is small and the travel time is close to zero, the models did not detect well. We think that these results occurred because phase pickers are designed to pick body waves, they do not work well with surface waves.

Fig. 10
figure 10

The result of picking travel time of a lag with a PhaseNet, b EQTransformer and c our model. The purple dashed line is travel time picked by models, and the red line is the true travel time

The self-attention mechanism of the Transformer encoder is a crucial part of our model, which makes the model pay more attention to the travel time part of the cross-correlation. It also reduces the influence of noise to a certain extent. The cross-correlation is first processed by the CNN downsampling block. The length of the sequence shortens from 2400 to 600, which means that the attention vector output by the encoder at each time step corresponds to 4 samples in the original data. Figure 11 shows the attention distribution of multiple cross-correlations after passing through the attention block. For the convenience of observation, we have enlarged the original value. The darker the color in the bar below the cross-correlation, the more attention the data corresponding to that part has got. For different cross-correlations, the dark areas are concentrated in the travel time part, indicating that the attention mechanism plays a role in the model.

Fig. 11
figure 11

Attention visualizations for different cross-correlations. Dashed lines indicate the travel times. The darker the color in the bar, the more attention is paid to the corresponding part of the cross-correlation

In order to illustrate the formation process of self-attention in more detail, we plot the attention map of the 600 × 600 2-D attention matrix of the data in Fig. 11a in Fig. 12, and the values in the matrix are also enlarged. The darker the color in the attention map, the higher the attention score. Since each sample corresponds to a vector after passing through the downsampling block, a sequence corresponds to a matrix. Multiply this matrix by own transpose, and then perform Softmax operation on each row to get the self-attention map in Fig. 12b, and sum up each column of the self-attention matrix to get the attention bar in Fig. 12a. Each value in the self-attention map represents the size of the relationship between the sample in the cross-correlation at the row position where the value is located and the sample in the cross-correlation at the column position where the value is located. The values on the diagonal of the matrix are large because each sample in the cross-correlation is strongly related to itself. It can be seen that in the self-attention map, the color of the part between the red dashed lines which corresponds to the travel time part is obviously darker than others.

Fig. 12
figure 12

a the cross-correlation in Fig. 11a. The dashed line indicates the travel time. b attention map of the cross-correlation

Conclusion

Travel time of seismic ambient noise cross-correlation is important in some research interests of geoscience. However, it is difficult to accurately extract the travel time of noise cross-correlation function with low SNR. Based on the widely used CNN and Transformer, we present a deep neural network model to pick the travel time of ambient noise cross-correlation. This model makes us to obtain accurate travel time. After training the model with 100,000 data from the ANCC-CIEI dataset combined with the recursive STA/LTA method, it performs well on the dataset Global Empirical Greens Tensors, which contains ambient noise cross-correlations at the global scale and the continental scale. The travel time of the ambient noise cross-correlation at different scales can be picked. For cross-correlations with low SNR, our model can also pick the travel time, and the lowest SNR of the cross-correlations whose travel times can be picked is 9.3 which proves that our method is useful for the low SNR data. And we selected some cross-correlation to test the accuracy of travel times picked by the model. 97.2% of picked the travel times were accurate. After filtering and shifting the cross-correlation function, the travel time can still be picked accurately. For cross-correlations at different scales, our method still performances well. The self-attention mechanism works well. It focuses the model's attention more on the travel time part of the cross-correlation. Compared with single CNN and a traditional method, the model has a significant improvement on cross-correlations with low SNR. In general, the travel time picked by our model is reliably and accurately. For the accurate result, our method can be applied to clock error measurements of stations, geological structure inversion between two stations, noise source distribution exploration around stations, body wave extraction and other applications.

Data and resources

The two ambient noise cross-correlation datasets, ANCC-CIEI (http://ds.iris.edu/ds/products/ancc-ciei/, last accessed June 2022) and Global Empirical Greens Tensors (http://ds.iris.edu/ds/products/globalempiricalgreenstensors/, last accessed June 2022), which are used for training and testing, respectively, can be downloaded from IRIS. We used Pytorch, a deep-learning framework for Python, to train the model (the latest version of Pytorch is available at https://pytorch.org/, last accessed June 2022). Most figures were generated using Matplotlib (Hunter 2007), a comprehensive library for creating visualizations in Python (https://matplotlib.org/, last accessed June 2022).