1 Introduction

Considerable improvement in the monitoring capability of seismic stations has enabled numerous seismic events to be recorded by seismometers. On-duty officers can identify three types of seismic events from seismic waveforms: natural earthquakes (NEs), blasting events, and collapse events. Rapid economic development has resulted in an increasing number of events, such as small equivalent blasting events and collapses, which poses new challenges to emergency decision-making by local governments and seismological research. First, the inclusion of non-NE events in the NE catalog (Kortstrom et al. 2016; Wei et al. 2019) results in inaccurate earthquake risk assessment and bias in the basic earthquake prediction data. Second, compared to NE events, non-NE events have a relatively shallow epicenter and are often felt more intensely on the surface, which are likely to result in higher casualties and social impact and could even trigger a large public outcry (Qian 2014; Xu et al. 2020). Therefore, accurately capturing the waveform characteristics of various types of seismic events and quickly determining event types are very important for decision-making by emergency rescue governmental departments and seismological research. Since the 1950s, researchers have conducted extensive and in-depth research on the characteristics of non-NE waveforms, including the first motion polarity of P-wave, the amplitude ratio of P-wave and S-wave, the maximum amplitude over duration of waveforms, surface wave clarity, and cepstrum analysis (Liu et al. 2012; Wang et al. 2013, 2017; Zhou et al. 2021a, b). NEs, blasting, and collapse events are clearly distinguished in waveform characteristics (Fig. 1), blasting events have strong P-waves but relatively weak S-waves, while NEs have strong S-waves. Therefore, the ratio of the maximum amplitude of the vertical component P-wave to maximum amplitude of S-wave can reflect the differences between blasting events and NEs. Due to the shallow focal depth, blasting and collapse will excite Rayleigh waves propagating along the ground surface. When the epicentral distance is relatively short, both blasting and collapse have obvious surface waves. However, for NEs within 200 km of the epicenter, surface waves are generally not obvious.

Fig. 1
figure 1

Comparison of waveform between NE (a), blasting (b), and collapse (c)

The development of seismic big data and artificial intelligence has enabled the application of machine learning to non-NE identification. The model parameters are trained with the characteristics of seismic data, and the classification model is obtained and used for event type discrimination (Bian 2002; Huang et al. 2010; Bi et al. 2011; Zhao et al. 2017; Fan et al. 2019; Liu et al. 2020; Cai et al. 2020). The main difficulty with traditional machine learning is that waveform features must be manually selected and then extracted using a specific technique, which may result in the loss of other valuable features in the data (Zhang et al. 2021; Zhou et al. 2021a, b).

In recent years, deep learning technology has developed rapidly and been widely used in image recognition, speech recognition and autonomous driving. Seismology researchers have applied convolutional neural networks (CNNs) to non-NE identification and other fields (Chen et al. 2018; Ross et al. 2018; Zhou et al. 2021a, b; Duan 2021; Li et al. 2021). Deep learning can be distinguished from traditional machine learning by the following features: less manual intervention in feature extraction and automatic extraction of waveform data features increases the retention of valuable information; most models used are deep neural networks, resulting in a relatively high model complexity (Zhang et al. 2021).

This study was performed on seismic events recorded at five or more stations in Shandong during 2017–2022, including 1615 NEs, 1578 blasting events, and 193 collapse events. A CNN model was established, to which images of three-component seismic waveforms were input to generate a waveform image classifier to identify NE, blasting, and collapse in Shandong. Due to the similarity between collapse and blasting in some characteristics of waveform, some collapse events are misidentified as blasting. In the paper, the time–frequency spectrum (T-FS) is generated using a generalized S-transform on waveform data and used as the CNN input, and a trained T–FS image classifier was used to accurately distinguish between collapse and blasting events. This approach is referred to as a two-step CNN, i.e., a waveform image classifier is used in step 1 to identify an event, and a T–FS image classifier is used in step 2 to further identify collapse events that have been misidentified as blasting events in step 1, thus improving the identification accuracy.

2 CNN

The CNN was first proposed by Professor Lecun at the University of Toronto, Canada (Li et al. 2017). A CNN is a multilayer neural system that first uses convolutional layers to extract features and then uses fully connected layers as classifiers. A CNN is mainly used for image recognition (Lecun and Bottou 1998; Sermanet et al. 2012). The CNN structure consists mainly of an input layer, hidden layer, and output layer. The hidden layer usually includes convolutional layers, pooling layers, and fully connected layers.

  1. (1)

    The convolutional layer mainly extracts local regional features. Different convolution kernels can be considered as different feature extractors. The sparse connection and shared weight mechanism of this layer can reduce the number of required parameters, decrease the complexity and computational load of the network, and improve the training speed. The input data for the previous layer are convolved with the convolution kernel, and a nonlinear activation function is then applied to the convolution to transfer the data to the next layer. Nonlinear activation functions, such as ReLU, can be used to enhance the nonlinear network characteristics.

  2. (2)

    The pooling layer is typically used to reduce the convolutional layer output, decrease the computational load of the model, and improve the generalization ability of the model. The data for the previous layer is downsampled according to the given local size. Typical downsampling modes include maximum pooling and average pooling.

  3. (3)

    The fully connected layer is equivalent to a classifier, in which each neuron is connected to the neuron of the previous layer and transmitted to the output layer after being operated on by the activation function. Hinton et al. (2012) proposed a dropout strategy in the fully connected layer to randomly inhibit some neurons in the training phase, whereby only some neurons are updated in the CNN to effectively alleviate overfitting of the fully connected layer (Li et al. 2021).

3 Generalized S-transform

Stockwell, an American geophysicist, proposed the S-transform based on the short-time Fourier transform and wavelet transform (Stockwell et al. 1996). The S-transform solves the problem of the fixed time window of the short-time Fourier transform and has the multiresolution characteristics of the wavelet transform. However, the use of a fixed parameter results in a frequency-dependent time window for the S-transform, which limits the flexibility of this transform for practical applications. Therefore, many scholars have improved the S-transform in different ways. In this study, the T–FS of a seismic record (Wei et al. 2022) was obtained using the improved generalized S-transform of signal \(u(t)\) given below:

$$S(\tau ,f)={\int }_{-\infty }^{\infty }u(t)w(\tau -t,f){e}^{-i2\pi ft}dt$$
(1)

where \(S(\tau ,f)\) is the T–FS of the signal \(u(t)\) and \(w(t,f)\) is the Gaussian window function given below:

$$w(t,f)=\frac{1}{\sigma \sqrt{2\pi }}{e}^{\frac{-{t}^{2}}{2{\sigma }^{2}}}$$
(2)

where \(\sigma =\frac{1}{\lambda {\left|f\right|}^{p}}\) is a scale factor that controls the width of the Gaussian window, \(p=p(f)=a+bf\).

4 CNN design and training

4.1 Network model

In this study, a 7-layer network model was designed consisting of four convolutional layers and three pooling layers. The convolution kernel size is \(3\times 3\), the number of output filters of the first and second convolutional layers is 64, the step size is \(1\times 1\), and each convolutional layer is equipped with a maximum pooling layer with a window size of \(2\times 2\) and a step size of \(1\times 1\). The third and fourth convolutional layers have 32 filters and its step size is \(1\times 1\). The third and fourth convolutional layers is followed by the application of a maximum pooling layer with a window size of \(2\times 2\) and a step size of \(1\times 1\). In the last three fully connected layers, the first and second layers have 128 and 32 neurons, respectively, and the third fully connected layer uses the softmax activation function, with a total of three neurons, i.e., three output types.

The ReLU function was used as the nonlinear activation function for model training. The ReLU function simply transmits positive values and sets negative values to zero and is expressed below:

$$\mathrm{Re}{\text{LU}}(x)=\mathrm{max}(x,0)$$
(3)

The Adam stochastic optimization algorithm was used for model training (Kingma and Ba 2014). The cross-entropy loss function is typically employed in CNNs to measure the difference between predicted and actual values for classification and regression tasks. Therefore, the cross-entropy loss function was used in the training process in this study.

4.2 Data analysis and training

A total of 165 observation stations in the Shandong seismic network were selected, and an analysis was performed on 3386 seismic events recorded by at least five stations between August 2017 and January 2022, including 1615 NEs, 1578 blasting events, and 193 collapses. To reduce the effect of human intervention on CNN feature extraction, the data were not screened before training to include all the seismic records of a single event in the training. In the test, the identification results of all the single records obtained for an event were counted. An event type was determined when the proportion of records identifying as this type was ≥ 0.5 and largest among the three event types. Among the data for each type of seismic event, approximately 80% were used for the training set, and 20% were used for the test set. The entire identification process comprised two steps. In step 1, the waveform image classifier was used for preliminary determination. In step 2, a T–FS image classifier was used to further identify collapse events at some stations that have been misidentified as blasting events in step 1. The training process of the waveform-based CNN in step 1 consisted of three main substeps (Fig. 2).

  1. (1)

    Creation of a training sample library. The duration of the original waveform data depends on the epicentral distances and magnitudes. To preserve the image ratio, the original waveform was intercepted to the same length. The three-component data from the first 5 s to the last 35 s of the P-wave were intercepted, and the waveforms for an NE, blasting, and collapse were manually labeled and summarized.

  2. (2)

    Image ratio selection. A problem is encountered with using images as the training set, i.e., the waveform event characteristics depend on the image scale, and the image pixel size affects the captured feature details of the waveform. An appropriate image size was determined by conducting a pretest to train the same network model on the same computer, and the accuracy for a single record was calculated and is shown in Table 1. The highest average accuracy was found for the No. 2 combination, and the three-component waveform generated at a pixel size of 160 × 106 was finally selected for training to obtain the waveform classifier.

  3. (3)

    Determination of the event type. The waveform image classifier was used to identify seismic events. For a given event, if the same type was identified at more than 50% stations, the identified type was used as the classification result. If the seismic waveform characteristics are not obvious, and the number of stations identified as different events is the same and accounts for the largest proportion, this event is defined as an uncertain event.

Fig. 2
figure 2

Flowchart of convolutional neural network training

Table 1 Accuracy and calculation time of test waveform image with different pixel size

Table 2 shows the identification accuracy of the waveform image classifier for the test set obtained using the process described above. The identification accuracy was greater than 95% for NEs and blasting events and was 73.68% for collapses. Among the nine misidentified collapse events, eight were misidentified as blasting events, and among these eight events, the waveform records from some stations were identified as collapses. Taking the collapse event shown in Fig. 3 as an example, (b) and (c) were identified as collapse events, and the rest were identified as blasting; therefore, the event was identified as a blasting. The event needs to be further analyzed to distinguish the collapse from the blasting events.

Table 2 Identification accuracy of the events between 2020 and 2022 using waveform image classifier
Fig. 3
figure 3

Vertical component of ML1.7 collapse in Sishui County, Jining City, Shandong Province

The main frequency and bandwidth of 1578 blasting events and 193 collapse events were calculated. Figure 4 shows the resulting statistics, where the main frequency and bandwidth of most blasting events are larger than those of the collapse events. The T–FS of the blasting and collapse events were calculated separately. Figure 5 shows obvious characteristic differences in the time–frequency domain between a blasting and collapse, i.e., a blasting has a wider bandwidth and a higher main frequency than a collapse.

Fig. 4
figure 4

Spectral feature of blasting and collapse (193 blasting events were selected form 1578 blasting events randomly)

Fig. 5
figure 5

Comparison of time–frequency amplitude spectrum between blasting (ac) and collapse (df) of different epicentral distance

The spectral feature is the overall frequency distribution of the seismic waveform, and the T–FS reflects the variation in the frequency at different time and the variation in the frequency component at the same time. Using the bandwidth and main frequency to identify blasting and collapse, we need to specify the threshold, and different threshold will lead to different classification results, with a certain degree of subjectivity. The T–FS contains time–frequency domain information such as frequency band width, main frequency, and seismic wave energy distribution feature, which has higher accuracy and more stable results, which is convenient for generalization to seismic event identification in different regions.

To exploit the difference between blasting and collapse in the frequency domain, the T–FS image was input into the network model presented in Sect. 4.1 to identify the two event types. Among the blasting events identified by the waveform image classifier, those that were identified as collapses in the records of some stations were reidentified by the T–FS image classifier.

4.3 Results and analysis

The model was trained using 80% of the waveform image, and 20% of the dataset was selected for testing. The identification accuracy rates of NEs, blasting events, and collapse events were 97.50%, 95.87%, and 73.68%, respectively. Table 3 shows the identification accuracy of two-step convolutional neural network. Two-step identification by the T–FS image classifier resulted in an accuracy rate of 95.87% for blasting events and 86.84% for collapses. The combined use of the two classifiers effectively improved the identification accuracy rate of collapses.

Table 3 Identification accuracy of the events between 2020 and 2022 based on two-step convolutional neural network

In order to illustrate the effectiveness of the two-step method of convolutional neural network in identifying seismic event types, the identification accuracy of the manual classification of seismology experts and support vector machine (SVM) (Fan et al. 2019) and only using waveform image as input of convolutional neural network were compared using Shandong seismic network data, and the comparison results are shown in Table 4.

Table 4 The average identification accuracy of seismic event classification in Shandong Province using different methods

For the seismic events recorded by the Shandong seismic network, the accuracy of manual classification of natural and unnatural seismic events was 92% (Zhou et al. 2021a, b). For the three types of seismic events of Shandong seismic network, seismology experts used different convolutional neural network models to identify them and compare the results (Zhou et al. 2021a, b), and the average accuracy of the best convolutional neural network model was 91.7%. In this paper, the waveform image and T-FS of the three types of events were both used, and the average accuracy is 96.13%, which greatly improves the recognition accuracy.

The support vector machine method was used to identify NEs, blasting, and collapse events of Shandong seismic network, and the average accuracy of the three types of events is 91.6–95% using different parameters.

5 Discussion and conclusions

NEs, blasting, and collapse events have different waveform characteristics. In this study, a 7-layer CNN was designed for non-NE identification. First, a waveform image was used as input to obtain a waveform image classifier to identify NE, blasting, and collapse events in Shandong. Second, to address the problems of misidentification of collapse as blasting, a T–FS image classifier was developed with T–FS images as the input, and a two-step CNN was used to achieve non-NE identification with good results. The following insights were gained.

  1. (1)

    To highlight the differences in waveform characteristics, comparison tests were conducted to determine the image ratio and pixel size of the waveforms. Considering the identification accuracy and time cost, the images of three-component seismic waveforms generated at a pixel size of \(160\times 160\) were finally used.

  2. (2)

    A test set of 673 seismic events was established consisting of 320 NEs, 315 blasting events, and 38 collapses. Two classifiers were used to identify event types, with an identification accuracy of 97.50% for NEs, 95.87% for blasting events, and 86.84% for collapses.

  3. (3)

    The types of some ambiguous events cannot be identified by a CNN. For example, some manually tagged collapse events have a downward first motion of all records, resulting in very similar waveforms and T–FS characteristics to those of blasting events. Therefore, it is difficult to make a deterministic identification based on the waveform and T–FS characteristics alone, and more characteristics of similar events must be collated for use as constraints.

In summary, a two-step CNN can be used to identify non-NEs, where the waveform image classifier can achieve high identification accuracy in identifying NEs and blasting events, and the T–FS image classifier, as a complement to the waveform image classifier, can effectively improve the identification accuracy of collapse events. For the few events for which the type cannot be determined from the waveform and T–FS characteristics, specificity can be increased by the addition of constraints, such as the P-wave first-motion polarity.