Abstract
The travel time of ambient noise cross-correlation is widely used in geophysics, but traditional methods for picking the travel time of correlation are either difficult to be applied to data with low signal-to-noise ratio (SNR), or make some assumptions which fail to be achieved in many realistic situations, or require a lot of complex calculations. Here, we present a neural network based on convolutional neural networks (CNN) and Transformer for the travel time picking of ambient noise cross-correlation. CNNs expand the dimension of the vector of each time step for the input of Transformer. Transformer focuses the model’s attention on the key parts of the sequence. Model derives the travel time according to the attention. 102,000 cross-correlations are used to train the network. Compared with traditional methods, the approach is easy to use and has a better performance, especially for the low SNR data. Then, we test our model on another ambient noise cross-correlation dataset, which contains cross-correlations from different regions and at different scales. The model has good performance on the test dataset. It can be seen from the experiment that the travel time of the cross-correlation function of ambient noise with an average SNR as low as 9.3 can be picked. 97.2% of the picked travel times are accurate, and the positive and negative travel time of most cross-correlations are identical (90.2%). Our method can be applied to seismic instrument performance verification, seismic velocity imaging, source location and other applications for its good ability to pick travel time accurately.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Because the ambient noise cross-correlation does not depend on the location of the source and can be continuously collected, related methods are widely used in earth science research. They have achieved many meaningful results in ambient noise tomography (Shapiro et al. 2005; Lin et al. 2008; Chen et al. 2018; Zhang et al. 2020a, b), temporal variations of the surface wave velocity (Grêt et al. 2006; Wegler and Sens-Schönfelder 2007; Brenguier et al. 2008) and source location (Xie et al. 2020). The travel time of ambient noise cross-correlation plays an important role in these researches. Especially, when noise cross-correlations are used to verify the seismic instrument performance, accurate travel time picking is crucial (Stehly et al. 2007; Gouédard et al. 2014; Ye et al. 2018). There are many methods to measure travel time (Stehly et al. 2007; Tsai 2009; Djebbi and Alkhalifah 2014), but these methods are either difficult to be applied to data with low signal-to-noise ratio (SNR), or make some assumptions which fail to be achieved in many realistic situations, or require a lot of complex calculations. For example, the dispersion curve of the surface wave can be used to obtain the travel time of the ambient noise cross-correlation in most cases, but it is difficult to pick the accurate travel time when the frequency band is narrow and the frequency is high. This problem becomes even more prominent when the SNR of the ambient noise is low. The travel time of correlation with high SNR is the position of the maximal amplitude, but it will be disturbed or even submerged by noise for low SNR data. Stacking short-term noise cross-correlation functions over long time periods can improve the SNR, but the longer the stacking time periods, the lower the temporal resolution (Yang et al. 2022). Methods proposed by Djebbi and Alkhalifah (2014) and Tsai (2009) use formulas to obtain accurate travel time but make some assumptions which can fail to be achieved in many realistic situations. So, a method which is more general and effective and can pick the travel time accurately is necessary.
In recent years, deep learning has been widely used in seismology. Many deep learning methods are used in the study of seismic waves (Perol et al. 2018; Viens and Van Houtte 2020; Zhang et al. 2020a, b; Song et al. 2021), but there are few models for ambient noise. In particular, convolutional neural network (CNN) and recurrent neural network (RNN) are widely used in earthquake location (Huang et al. 2018; Mousavi et al. 2019), seismic wave denoising (Zhu et al. 2019; Novoselov et al. 2022) and seismic phase picking (Zhou et al. 2019; Zhu et al. 2019; Chai et al. 2020). Among these applications, seismic phase picking is similar to our task, but models for phase picking are not suitable for travel time measurement. Different from seismic waves, ambient noise consists mostly of elastic surface waves. The seismic phase picking is to extract the first arrival of the body wave, while the travel time picking is to pick the arrival of the surface wave. Surface waves travel more slowly through earth material at the planet’s surface and have lower frequency, larger amplitudes and longer wavelengths compared with the body wave. In addition to the difference of the seismic wave and ambient noise cross-correlation, there are some problems with applying CNN and RNN to seismic data. For CNN, the receptive field is subjected to the kernel size, so most of CNN models stack many layers to expand the receptive field which makes the number of parameters increase. For RNN, when it is applied to long sequences, it is difficult to train. Therefore, many RNNs for phase picking often gather multiple points into one point to shorten the length of the seismic wave, which will cause the loss of data information. Transformer proposed by Vaswani et al. (2017), which relies entirely on self-attention mechanism, can well solve the long-term dependence of RNN. And the self-attention mechanism can be considered as a kind of global receptive field. So, we consider to combine CNN and Transformer to pick the travel time. When a sequence is fed into Transformer, each time step should be a vector rather than a scalar. Thus, a CNN module can be used to increase the number of channels of the cross-correlation, which makes each time step from a scalar to a vector. And CNN can also properly shorten the sequence length. The core of the whole model is still Transformer.
Our basic idea is to train a combined CNN and Transformer model on a large dataset to pick the travel time of ambient noise cross-correlation. There are two CNN blocks, the first one is used to expand the number of channels and shorten the sequence length and the second one is used to restore the sequence length. And both of them can perform preliminary feature extraction. In order to append more initial information, we add a second channel to the original sequence. After inputting the positive and negative cross-correlation lags into the model, the output is a vector of probabilities, and the value in the vector is between 0 and 1 indicating the probability that each sample in the cross-correlation is travel time. The position of the maximum value in the probability vector is the travel time. We expect our model can be applied to cross-correlations of different regions, and especially the travel times submerged by noise can also be picked, and most of the picked positive and negative travel times have no difference. To investigate the applicability of our method, we apply the trained model on another ambient noise cross-correlation dataset, Global Empirical Greens Tensors, and it shows good generalization ability.
Data and method
Data
We used an ambient noise cross-correlation dataset from Incorporated Research Institutions for Seismology (IRIS), ANCC-CIEI (see Data and Resources), which contains data from 622 USArray Transportable Array stations west of 105°W longitude (Fig. 1) between 2005/1/1 and 2010/12/31. There are 171,120 ambient noise cross-correlation waveforms (in SAC format) in the dataset. Four traditional ambient noise data processing procedures were used to process these correlations: (1) single station data preparation, (2) cross-correlation and temporal stacking, (3) measurement of dispersion curves and (4) quality control. The sampling rate of the cross-correlation is 1.0 (in sec), the number of samples is 7201, the lag length is 3600, and the bandpass is 15–50 s period. We define SNR as the ratio of the peak signal within the signal window to rms noise in the trailing noise window (Bensen et al. 2007). According to this calculation method, we calculated the average SNR of each data and found that 57.9% of cross-correlations had SNR larger than 30, 29.9% of cross-correlations had SNR larger than 15 and less than 30, and 12.2% of cross-correlations had SNR less than 15.
We randomly selected 80,000 cross-correlations from the dataset, of which 60,000 had SNR larger than 15, and the other 20,000 had SNR less than 15. We add random Gaussian white noise to 20,000 correlations for data augmentation. Among these 100,000 data, 95,000 randomly chosen data are used for training, and the rest (5000) are used for validation. We selected 2000 ambient cross-correlations from another dataset (Global Empirical Greens Tensors, see Data and Resources) for testing the performance of the model after training. So, the final dataset contains 102,000 cross-correlations, which was split into a training set (93%), a validation set (5%) and a test set (2%).
Data with more initial information can make the model perform better, so we calculated the short-term average/long-term average ratios (STA/LTA) of cross-correlations. The recursive STA/LTA algorithm, which has faster calculation and higher signal pickup sensitivity than the traditional STA/LTA algorithm, is used here. The recursive STA/LTA ratio is calculated as follows:
where \(A_{i}\) is the amplitude of \(i\) th sample, \({\text{CF}}_{i}\) is the characteristic function of \(i\) th sample, and \(N_{s}\) and \(N_{l}\) are the time windows in number of samples to compute the short-term average and long-term average, respectively. Here, \(N_{s}\) is 100 and \(N_{l}\) is 1200.
Because our method picks the travel times of the positive correlation lag and the negative correlation lag, respectively, we only select the positive or negative part of each cross-correlation as training data. Positive parts of 50,000 cross-correlations are selected, and negative parts of the other 50,000 are selected. The corresponding STA/LTA curves of negative part and positive part should be the same part of the whole, so the starting calculation positions of the STA/LTA method of the positive and negative part are different. The STA/LTA ratio of the positive part is calculated from the right side of the cross-correlation, and take the right half of it (the part between the two dashed lines in Fig. 2b). The STA/LTA ratio of the negative part is calculated from the left side of the cross-correlation and take the left half of it (the part between the two dashed lines in Fig. 2c). According to this calculation method and selection method, both positive and negative parts correspond to the first half of the overall STA/LTA curves. Since \(N_{l}\) is 1200, the STA/LTA ratios of the first 1200 samples are 0 by default. In order to eliminate the influence of these 0 values during data normalization, it is needed to remove them and the corresponding samples of positive or negative part (Fig. 2a). We combine the positive or negative correlation lag and its corresponding STA/LTA ratio into a two-channel data.
We use the form of triangular to label travel times. In this form, the probability of travel time is set to 1 and linearly decreases to 0 within 20 samples before and 20 samples after travel time (Fig. 3a). Finally, there are 100,000 data for training, and each data has two channels: (1) the positive correlation lag or the negative correlation lag (Fig. 3a); (2) the corresponding recursive STA/LTA ratio (Fig. 3b).
Method
CNN is a kind of feedforward neural network with deep structure. CNNs, consisting of convolutional layers, activation layers and pooling layers, have powerful feature extraction capabilities and are mainly used for multi-channel data processing. Compared with fully connected networks, CNNs have a parameter-sharing scheme, which can largely reduce the number of trainable parameters and thus makes CNNs go deeper to learn the complex relationship between inputs and outputs. Many classic CNNs (such as VGG and ResNet) are mainly used for image recognition and object localization, and have achieved excellent performance (Simonyan and Zisserman 2014; He et al. 2016).
Transformer proposed by Vaswani et al. (2017) is the state-of-the-art sequential model in the natural language processing (NLP) field. Transformer entirely relies on self-attention mechanism. The role of the self-attention mechanism is to find out the dependency between different samples (time steps) in a sequence. It can assign the weight of each time step in a sequence, so that the model can pay more attention to the key parts of the overall data. Transformer solves the long-term dependence and the difficulty of training of RNN and the performance of it can reach or even exceed RNN. Transformer consists of an encoder and a decoder. The encoder part and the decoder part are designed for discriminative tasks and generative tasks, respectively. The encoder has three identical inputs: Query (Q), Key (K), and Value (V). The output is the self-attention calculated by Q, K, and V, which is calculated as follows:
where \(d_{k}\) is the length of the vector of each time step.
The encoder and decoder are composed of multiple identical layers, and each layer has two sub-layers. The first sub-layer is a multi-head self-attention mechanism, which performs different linear transformations on Q, K, and V through h heads, and then concatenates the h attention heads:
in which \({\text{head}}_{i} = {\text{Attention}}\left( {{\text{QW}}_{i}^{Q} ,{\text{KW}}_{i}^{K} ,{\text{VW}}_{i}^{V} } \right)\). The second sub-layer is a fully connected feedforward network, which performs a nonlinear transformation on multi-head attention.
Due to the powerful feature extraction capability of CNN and the sequence processing capability of Transformer, our model combines them. CNN is used to expand the number of channels of the sequence and shorten or restore the length of the cross-correlation. Transformer is used to process the sequence, and uses the self-attention mechanism to make the model pay more attention to the travel time part of the cross-correlation. We only used the encoder of Transformer because the output (self-attention matrix) of it is what we need. Our model consists of four parts: downsampling block (CNN), attention block (the encoder of Transformer), upsampling module (CNN) and output block (fully connected network). The specific parameters and changes of data shape are shown in Table 1. The input shape of the model is (n, 2, 2400), corresponding to n data with two channels and 2400 samples for each channel. The first channel of the data is the positive or negative correlation lag, and the second channel is the corresponding recursive STA/LTA ratio. The output of the model is a probability vector of length 2400, indicating the probability that each sample is a travel time. Then, we will introduce the functions of the first three blocks in detail.
The main purpose of the first CNN block (downsampling block) is to prepare for the input of the following attention module: expand each sample to a fixed length vector and shorten the data length. An original sequence has only two channels, that is, the vector dimension of each time step is 2. Such a low-dimensional vector is not suitable for input to Transformer. Therefore, we use CNN to extend the number of channels of the sequence. The downsampling block contains four convolution layers. The size of each convolution layer is shown in Fig. 4; the padding scheme is same; and only the first two convolution layers are followed by pooling layers. Although pooling layers add some translation invariance to the model, using too many pooling layers would lead to loss of useful feature information. Though the downsampling block can perform preliminary information extraction on the input data; it is mainly used to shorten the data length (from 2400 to 600) and increase the number of data channel (from 2 to 32). When talking about Transformer, we will further explain why the length is shortened and the vector dimension of each time step is expanded.
In NLP tasks, the input of the encoder is not each word in a sentence, but the vector obtained by each word (time step) after embedding and positional encoding. In order to obtain a similar input, we need to transpose the output of the downsampling block. The output shape of the downsampling block is (n, 32, 600), representing that each data has 32 channels and the length of each channel is 600. We transpose the output from (n, 32, 600) to (n, 600, 32), indicating that there are n sequences, each sequence has 600 time steps which are called tokens in NLP tasks, and the vector dimension of each time step is 32. In the traditional Transformer, the vector dimension of each token is much larger than 32, such as 512 and 1024. But the vector dimension of 32 is enough for us, because seismic data is not particularly complicated compared with language. Many single CNN seismic neural networks also expand the data channel to 32 (Zhu and Beroza 2019; Zhou et al. 2019). After that, positional encoding is added to the vector of each time step to generate the input of the attention block.
Without the constraints of memory and computing power, Transformer can theoretically encode infinitely long sequences. However, the amount of computation of attention is huge, and the computational complexity has an O(n2) relationship with the sequence length, so as the length of the sequence increases, the memory and amount of computation will increase rapidly. That’s why in the previous block, we use two pooling layers to reduce the sequence length to 600. In other words, each time step in the sequence output by the downsampling block corresponds to four samples in the original sequence. But each time step in the output sequence cannot correspond to a particularly large number of samples in the original sequence. If the length of the input sequence is excessively shortened, features of it may be lost, which is why we only shorten the original sequence to a quarter. For the output of the attention block, we also need to transpose it back to the shape (n, 32, 600) for the input of the next CNN upsampling block.
The main purpose of the second CNN block (upsampling block) is to restore the time series to the original length and further extract information after obtaining the attention distribution of a sequence. Because after the downsampling block and Transformer block, the length of the sequence is shortened to 600, and our output is a probability sequence with the same length as the original data, so we use CNN to restore the sequence to the original length. The upsampling block also contains 4 convolution layers. The size of each convolution layer is shown in Fig. 4, the padding scheme is same and the last two convolution layers are preceded by an upsampling layer to restore the data to the original length of 2400.
The activation function of the fully connected network output block is sigmoid. The output block is used to output a probability sequence. Each value in the sequence represents the probability that the corresponding sample in the cross-correlation is travel time.
When picking the travel time of an ambient noise cross-correlation, some simple preparatory work is needed to generate data suitable for the network input. First, we need to calculate recursive STA/LTA ratios starting from the left (negative part) and the right (positive part) of the cross-correlation, respectively. Then, scale the amplitude of the cross-correlation to − 1 ~ 1, and the amplitude of the two recursive STA/LTA ratio curves to 0 ~ 1. Because the travel times of negative correlation lag and positive correlation lag are picked, respectively, we split the positive and negative parts of the cross-correlation. For the negative part, we choose the left half of the recursive STA/LTA ratio which is calculated from the left of the cross-correlation (Fig. 2c) and combine it with the negative part to form the negative data with two channels. And for the positive part, we choose the right half of the recursive STA/LTA ratio which is calculated from the right of the cross-correlation (Fig. 2b) and combine it with the positive part to form the positive data with two channels. Then, input the positive and negative data with two channels (the first channel is the negative or positive part which is shown in Fig. 3a and the second channel is its corresponding recursive STA/LTA ratio which is shown in Fig. 3b) into the model. The model will output a probability sequence with the same length as the input data. The location of the maximum amplitude in the sequence is the travel time.
Result and discussion
For convolutional and fully connected layers, all the weight parameters were initialized with a Xavier uniform initializer and biases were set to 0. The optimizer is adaptive moment estimation with weight decay (AdamW), the loss function is binary cross entropy for binary classification problems, the number of epochs is 100, and the learning rate is 1e-4 for the first 60 epochs, and then becomes 1e-5. We adopt three measures to avoid the occurrence of overfitting: (1) data augmentation. 20% of the data used for training is added with random Gaussian white noise. And for a cross-correlation function, we randomly select its positive or negative part; (2) dropout. There is a dropout layer with a dropout probability of 0.3 between two fully connected layers in the output block; and (3) AdamW optimizer. AdamW is a combination of adaptive moment estimation (Adam) and L2 regularization and weight decay. L2 regularization and weight decay are both useful methods to prevent overfitting. The model took 11 h to complete the training using a TITAN Xp GPU under the Pytorch framework (Paszke et al. 2019). Figure 5 shows the evolution of the training process. In the first few epochs, the loss decreases rapidly; starting from about the 15th epoch, the rate of decrease becomes slower, and after ~ 70 epochs, the model converges to the optimal solution (Fig. 5). The loss of training set and validation set are very close throughout the process. After 100 epochs iterations, the losses of the training set and validation set are 0.030 and 0.032, respectively, and the model has been able to achieve good performance.
To show that our model is superior to other traditional models and methods, we decide to compare the method with a single CNN model (without the attention block) and another method for travel time picking (maximum amplitude selection). The reason for comparison with CNN is that it is a traditional neural network architecture for seismic phase picking. There are 5 metrics to evaluate the performance of our model: precision, recall, F1 score, mean (μ) and standard deviation (σ) of time residuals (Δt). Time residual is the difference between the picked travel time and the true travel time. Precision, recall and F1 score are standard metrics of performance, which are defined as:
where \(T_{p}\) is the number of true positive, \(F_{p}\) is the number of false positive and \(F_{n}\) is the number of false negative. Travel time residuals (Δt) that are less than 2 s are counted as true positives, and the larger residuals are considered false positives. F1 score is a criterion to balance P and R. 500 cross-correlations with high SNR (SNR ≥ 15) and 500 cross-correlations with low SNR (SNR < 15) were used to compare our model with the single CNN model. The results are shown in Table 2. Since the maximum amplitude selection must be able to pick a travel time of each cross-correlation, precision, recall and F1 score of it are equal. For high SNR data, the result of our method is slightly better than that of CNN. But for low SNR data, our method achieved significant improvements. And our model is markedly better than traditional method on the whole. We can draw three conclusions from Table 2: a. attention block is important; b. the method is effective for low SNR data; and c. the method is better than traditional model and method.
We applied our model on Global Empirical Greens Tensors (see Data and Resources), another ambient noise cross-correlation dataset from IRIS. The dataset contains cross-correlations of two different scales: global scale and continental scale. Cross-correlations at the global scale are extracted from seismic data recorded by GSN, GEOFON and other broadband stations. Cross-correlations at the continental scale are from selected broadband regional networks and temporary deployments. The time series is normalized in the time and frequency domains with a frequency-time-normalization method (Shen et al. 2012). The region of the cross-correlations at the continental scale we choose is North America. The sampling rate of the cross-correlation at the global scale is 1.0 (in sec), the number of samples is 14,401, the lag length is 7200, and the bandpass is 20–600 s period. The sampling rate of the cross-correlation at the continental scale is 1.0 (in sec), the number of samples is 72,01, the lag length is 3600, and the bandpass is 8–300 s period.
Figure 6 illustrates the results of travel time picking for randomly chosen cross-correlations from the test dataset. It can be seen that our model performances well on the cross-correlations at the continental scale (Fig. 6a–c) and the global scale (Fig. 6d–h), which also shows that the model has a good generalization ability. For cross-correlations with low SNR, the model can still pick the travel times (Fig. 6e, h). When the SNR is very low, although the change of the probability curve on both sides of the travel time fluctuates, the curve still has peaks (the positive part in Fig. 6h). The negative part in Fig. 6b, the positive part in Fig. 6g and the positive part in Fig. 6h whose travel times are submerged by noise also suggest that our model does not simply pick the maximal amplitude.
Surface wave velocity varies significantly in different regions due to the geological structure and other factors. In order to show the accuracy of travel time picked by our mode, we selected cross-correlations in a small area (the maximum interstation distance is less than 1000 km) as the test data to verify whether the travel time picked by the model is accurate. First, the model was used to pick the travel times of 100 cross-correlations with high SNR (SNR > 50) in the area. We calculated the surface wave velocity of each cross-correlation according to the interstation distance and the picked travel time, and then calculate the average surface wave velocity(3.106 km/s). After that, we used the model to pick the travel times of 500 cross-correlations with low SNR (10 < SNR < 30). If the absolute value of the difference between the average surface wave velocity and the surface wave velocity calculated according to the interstation distance and the picked travel time is greater than 0.5 km/s, the result is considered incorrect. We found that 97.2% of the 500 low SNR cross-correlations’ travel times were picked accurately.
We then looked for the lowest average SNR of the cross-correlation that the model could pick the travel time. The specific steps are as follows: first, select a low benchmark SNR to find the lowest SNR more quickly, then choose a cross-correlation with an SNR lower than the benchmark SNR, and observe whether the model can accurately pick up the travel time. If the travel time can be picked, and the probability curve on both sides of the travel time does not fluctuate greatly, continue to select a cross-correlation with lower SNR until the model cannot effectively pick the travel time. The picking results of the model for the cross-correlations from high to low SNR are shown in Fig. 7. We rounded SNR to one decimal place and found that the lowest SNR of the cross-correlation whose travel time can be accurately picked by the model is 9.3 (Fig. 7e). When the SNR is lower than 9.3, the performance of the model deteriorates (Fig. 7f). As most studies only use cross-correlations with SNR larger than 10 (Luo et al. 2020), the performance of our model is enough. The travel times of cross-correlations picked by network in Fig. 7b, e, f demonstrate that our model does not simply pick the maximal amplitude.
The travel time of an ambient noise cross-correlation is affected by the noise source, geological structure and instrument response (Stehly et al. 2007): (1) a physical change in the medium would result in either a faster or slower travel time measured in both positive and negative correlation lag; (2) a clock error in one of the two stations or a change of the phase response of one of the sensors would product a time-shift of the whole cross-correlation relative resulting in a larger travel time in the positive time and a smaller apparent travel time in the negative time or vice versa; (3) a change of the spatial distribution of noise sources in the medium should affect the positive and negative correlation time independently because the positive and negative correlation time are sensitive to noise sources located in different regions. These factors would eventually cause the positive and negative travel times of a cross-correlation to be different, so we make changes to a cross-correlation to simulate the factors previously described. The data are changed by filtering and shifting, and the picking result is shown in Fig. 8. The original cross-correlation is filtered between 200 and 400 s period. Due to the different sources of ambient noise in different periods, we bandpass filtered the cross-correlation to 8–20 s period and 20–40 s period, respectively (Fig. 8b, c). There are two kinds of translation transformations. The first is the left and right translation of the whole waveform (Fig. 8d, e), and the second is to move the positive and negative lags closer to zero time and away from zero time. For the second translation transformation, we added random noise to the waveform after moving to fill in the values in the vacancies created by the translation (Fig. 8f, g). Then, we used the model to pick the travel time, and the model still performed well.
In the case of a perfectly isotropic distribution of sources, the cross-correlation between two stations is symmetric. If the density of sources is larger on one side than on the other, the cross-correlation is not symmetric in amplitude. But no matter whether the noise source density is the same or not, the positive and negative travel times are the identical (Stehly et al. 2006). While the previously mentioned factors will make the positive and negative times of cross-correlations different, for most cross-correlations, the positive and negative travel times are identical. Therefore, the identical travel times of positive and negative parts is also an important manifestation of the accuracy of the model picking. We tested our model on 1000 cross-correlations for the difference of the picked travel times of the positive and negative part. Among these 1000 cross-correlations, the minimum interstation distance is 36.25 km and the maximum is 2223.16 km; the minimum SNR is 12.3 and the maximum is 248.6. We counted the number of cross-correlations with different travel times of positive and negative parts in perspectives of interstation distance and SNR, and the results are shown in Fig. 9a and b. It can be seen that the relationship between the SNR and the occurrence of different travel times of the two lags is small, while the distance between the stations has a significant relationship with the occurrence of different travel times of the two lags. When the interstation distance exceeds 1100 km, the number of cross-correlations with different positive and negative travel times increases significantly (Fig. 9b). We think it is because the ANCC-CIEI used for model training is a small-scale dataset with data from stations in the area shown in Fig. 1. So, when the model is applied to cross-correlations at a small scale, it has better performance. But when the model is applied to cross-correlations at a large scale, the number of cross-correlations with different travel times of the two lags rises. As can be seen in Fig. 9d, 83.1% of the cross-correlations used for model training have an interstation distance less than 1300 km, while most of the cross-correlations used for testing have an interstation distance greater than 1000 km (74%, Fig. 9e). Even so, the number of cross-correlations with different travel times of the two lags only accounted for 9.8% of the total. Then, we counted the number of different differences (the difference is the absolute value of the travel time of positive part minus the travel time of negative part) and found that most of the differences are within 2 s (Fig. 9c) which is acceptable for cross-correlations with interstation distance greater than 1100 km. The mean of the difference is 0.175 s and the standard deviation is 0.713 s.
To further show the performance of our model, two well-known neural network (NN) phase pickers, PhaseNet (Zhu et al. 2019) and EQTransformer (Mousavi et al. 2020), were used to compare the ability to pick travel time with our model. PhaseNet is a U-net based NN model and EQTransformer which has the state-of-the-art earthquake detection and phase picking performance is a model combining CNN, RNN and Transformer. We tested with 200 ambient noise cross-correlations with different SNR and interstation distances and fed the positive and negative lag of each data separately into models. While the two models can pick a phase near the travel time, there was a significant deviation between picked phase and travel time, and the lower the SNR of the data, the greater the deviation (average deviation was 37.88 s). Figure 10 shows the result of picking travel time of a lag with PhaseNet, EQTransformer and our model, from which it can be seen that the travel time picked by NN phase pickers deviated from the real travel time. Though EQTransformer has the state-of-the-art phase picking performance, on the data with high SNR, the phase picked by PhaseNet was closer to the true travel time. For ambient noise cross-correlation with low SNR, both models identified it as noise, and the lowest SNR of the data that EQTransformer and PhaseNet can pick phases were ~ 11.31 and ~ 14.07, respectively. In addition, when the interstation distance is small and the travel time is close to zero, the models did not detect well. We think that these results occurred because phase pickers are designed to pick body waves, they do not work well with surface waves.
The self-attention mechanism of the Transformer encoder is a crucial part of our model, which makes the model pay more attention to the travel time part of the cross-correlation. It also reduces the influence of noise to a certain extent. The cross-correlation is first processed by the CNN downsampling block. The length of the sequence shortens from 2400 to 600, which means that the attention vector output by the encoder at each time step corresponds to 4 samples in the original data. Figure 11 shows the attention distribution of multiple cross-correlations after passing through the attention block. For the convenience of observation, we have enlarged the original value. The darker the color in the bar below the cross-correlation, the more attention the data corresponding to that part has got. For different cross-correlations, the dark areas are concentrated in the travel time part, indicating that the attention mechanism plays a role in the model.
In order to illustrate the formation process of self-attention in more detail, we plot the attention map of the 600 × 600 2-D attention matrix of the data in Fig. 11a in Fig. 12, and the values in the matrix are also enlarged. The darker the color in the attention map, the higher the attention score. Since each sample corresponds to a vector after passing through the downsampling block, a sequence corresponds to a matrix. Multiply this matrix by own transpose, and then perform Softmax operation on each row to get the self-attention map in Fig. 12b, and sum up each column of the self-attention matrix to get the attention bar in Fig. 12a. Each value in the self-attention map represents the size of the relationship between the sample in the cross-correlation at the row position where the value is located and the sample in the cross-correlation at the column position where the value is located. The values on the diagonal of the matrix are large because each sample in the cross-correlation is strongly related to itself. It can be seen that in the self-attention map, the color of the part between the red dashed lines which corresponds to the travel time part is obviously darker than others.
Conclusion
Travel time of seismic ambient noise cross-correlation is important in some research interests of geoscience. However, it is difficult to accurately extract the travel time of noise cross-correlation function with low SNR. Based on the widely used CNN and Transformer, we present a deep neural network model to pick the travel time of ambient noise cross-correlation. This model makes us to obtain accurate travel time. After training the model with 100,000 data from the ANCC-CIEI dataset combined with the recursive STA/LTA method, it performs well on the dataset Global Empirical Greens Tensors, which contains ambient noise cross-correlations at the global scale and the continental scale. The travel time of the ambient noise cross-correlation at different scales can be picked. For cross-correlations with low SNR, our model can also pick the travel time, and the lowest SNR of the cross-correlations whose travel times can be picked is 9.3 which proves that our method is useful for the low SNR data. And we selected some cross-correlation to test the accuracy of travel times picked by the model. 97.2% of picked the travel times were accurate. After filtering and shifting the cross-correlation function, the travel time can still be picked accurately. For cross-correlations at different scales, our method still performances well. The self-attention mechanism works well. It focuses the model's attention more on the travel time part of the cross-correlation. Compared with single CNN and a traditional method, the model has a significant improvement on cross-correlations with low SNR. In general, the travel time picked by our model is reliably and accurately. For the accurate result, our method can be applied to clock error measurements of stations, geological structure inversion between two stations, noise source distribution exploration around stations, body wave extraction and other applications.
Data and resources
The two ambient noise cross-correlation datasets, ANCC-CIEI (http://ds.iris.edu/ds/products/ancc-ciei/, last accessed June 2022) and Global Empirical Greens Tensors (http://ds.iris.edu/ds/products/globalempiricalgreenstensors/, last accessed June 2022), which are used for training and testing, respectively, can be downloaded from IRIS. We used Pytorch, a deep-learning framework for Python, to train the model (the latest version of Pytorch is available at https://pytorch.org/, last accessed June 2022). Most figures were generated using Matplotlib (Hunter 2007), a comprehensive library for creating visualizations in Python (https://matplotlib.org/, last accessed June 2022).
References
Bensen GD, Ritzwoller MH, Barmin MP, Levshin AL, Lin F, Moschetti MP, Shapiro NM, Yang Y (2007) Processing seismic ambient noise data to obtain reliable broad-band surface wave dispersion measurements. Geophys J Int 169(3):1239–1260. https://doi.org/10.1111/j.1365-246X.2007.03374.x
Brenguier F, Shapiro NM, Campillo M, Ferrazzini V, Duputel Z, Coutant O, Nercessian A (2008) Towards forecasting volcanic eruptions using seismic noise. Nat Geosci 1(2):126–130
Chai C, Maceira M, Santos-Villalobos HJ, Venkatakrishnan SV et al (2020) Using a deep neural network and transfer learning to bridge scales for seismic phase picking. Geophys Res Lett 47:e2020GL088651
Chen KX, Gung Y, Kuo BY, Huang TY (2018) Crustal magmatism and deformation fabrics in northeast Japan revealed by ambient noise tomography. J Geophysi Res Solid Earth 123(10):8891–8906
Djebbi R, Alkhalifah T (2014) Traveltime sensitivity kernels for wave equation tomography using the unwrapped phase. Geophys J Int 197(2):975–986
Gouédard P, Seher T, McGuire JJ, Collins JA, van der Hilst RD (2014) Correction of ocean-bottom seismometer instrumental clock errors using ambient seismic noise. Bull Seismol Soc Am 104(3):1276–1288
Grêt A, Snieder R, Scales J (2006) Tim-lapse monitoring of rock properties with coda wave interferometry. J Geophys Res Solid Earth. https://doi.org/10.1029/2004JB003354
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 770–778
Huang L, Li J, Hao H, Li X (2018) Micro-seismic event detection and location in underground mines by using convolutional neural networks (CNN) and deep learning. Tunn Undergr Space Technol 81:265–276
Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9(3):90–95
Lin FC, Moschetti MP, Ritzwoller MH (2008) Surface wave tomography of the western United States from ambient seismic noise: Rayleigh and love wave phase velocity maps. Geophys J Int 173(1):281–298
Luo Y, Yang Y, Xie J, Yang X, Ren F, Zhao K, Xu H (2020) Evaluating uncertainties of phase velocity measurements from cross-correlations of ambient seismic noise. Seismol Res Lett 91(3):1717–1729
Mousavi SM, Zhu W, Sheng Y, Beroza GC (2019) CRED: A deep residual network of convolutional and recurrent units for earthquake signal detection. Sci Rep 9(1):1–14
Mousavi SM, Ellsworth WL, Zhu W, Chuang LY, Beroza GC (2020) Earthquake transformer—an attentive deep-learning model for simultaneous earthquake detection and phase picking. Nat Commun 11(1):3952
Novoselov A, Balazs P, Bokelmann G (2022) Separating and denoising seismic signals with dual-path recurrent neural network architecture. J Geophys Res Solid Earth 127:e2021JB023183
Paszke A, Gross S, Massa F, Lerer A, Bradbury J et al (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems, vol 32
Perol T, Gharbi M, Denolle M (2018) Convolutional neural network for earthquake detection and location. Sci Adv 4(2):e1700578
Shapiro NM, Campillo M, Stehly L, Ritzwoller MH (2005) High-resolution surface-wave tomography from ambient seismic noise. Science 307(5715):1615–1618
Shen Y, Ren Y, Gao H, Savage B (2012) An improved method to extract very-broadband empirical Green’s functions from ambient seismic noise. Bull Seismol Soc Am 102(4):1872–1877
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Song W, Feng X, Wu G, Zhang G, Liu Y, Chen X (2021) Convolutional neural network, res-unet++, -based dispersion curve picking from noise cross-correlations. J Geophys Res Solid Earth 126(11):2021022027
Stehly L, Campillo M, Shapiro NM (2006) A study of the seismic noise from its long-range correlation properties. J Geophys Res Solid Earth 111(B10):B10306
Stehly L, Campillo M, Shapiro NM (2007) Traveltime measurements from noise correlation: stability and detection of instrumental time-shifts. Geophys J Int 171(1):223–230
Tsai VC (2009) On establishing the accuracy of noise tomography travel-time measurements in a realistic medium. Geophys J Int 178(3):1555–1564
Vaswani A, Shazeer N, Parmar N, Uszkoreit J et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
Viens L, Van Houtte C (2020) Denoising ambient seismic field correlation functions with convolutional autoencoders. Geophys J Int 220(3):1521–1535
Wegler U, Sens-Schönfelder C (2007) Fault zone monitoring with passive image interferometry. Geophys J Int 168(3):1029–1033
Xie J, Chu R, Ni S (2020) Relocation of the 17 June 2017 Nuugaatsiaq (Greenland) landslide based on Green’s functions from ambient seismic noises. J Geophys Res Solid Earth 125(5):e2019JB018947
Yang X, Bryan J, Okubo K, Jiang C, Clements T, Denolle MA (2022) Optimal stacking of noise cross-correlation functions. Geophys J Int 232(3):1600–1618
Ye F, Lin J, Shi Z, Lyu S (2018) Monitoring temporal variations in instrument responses in regional broadband seismic network using ambient seismic noise. Geophys Prospect 66(5):1019–1036
Zhang Y, Li H, Huang Y, Liu M, Guan Y, Su J, Wang T (2020a) Shallow structure of the Longmen Shan fault zone from a high-density, short-period seismic array. Bull Seismol Soc Am 110(1):38–48
Zhang X, Jia Z, Ross ZE, Clayton RW (2020b) Extracting dispersion curves from ambient noise correlations using deep learning. IEEE Trans Geosci Remote Sens 58(12):8932–8939
Zhou Y, Yue H, Kong Q, Zhou S (2019) Hybrid event detection and phase-picking algorithm using convolutional and recurrent neural networks. Seismol Res Lett 90(3):1079–1087
Zhu W, Beroza GC (2019) PhaseNet: a deep-neural-network-based seismic arrival-time picking method. Geophys J Int 216(1):261–273
Zhu W, Mousavi SM, Beroza GC (2019) Seismic signal denoising and decomposition using deep neural networks. IEEE Trans Geosci Remote Sens 57(11):9476–9488
Acknowledgements
We thank IRIS for providing the dataset which we used to train and test the model. This research is financially supported by Zhejiang Provincial Natural Science Foundation of China under Grant No. LQ20D040002.
Funding
This work was supported by Zhejiang Provincial Natural Science Foundation of China under Grant No. LQ20D040002.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors acknowledge there are no conflicts of interest recorded.
Additional information
Edited by Dr. Qamar Yasin (ASSOCIATE EDITOR) / Prof. Gabriela Fernández Viejo (CO-EDITOR-IN-CHIEF).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jin, C., Ye, F., Zhang, H. et al. Travel time picking of ambient noise cross-correlation using a deep neural network combining convolutional neural networks and Transformer. Acta Geophys. 72, 97–114 (2024). https://doi.org/10.1007/s11600-023-01088-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11600-023-01088-3