Introduction

Sedimentology is a sub-discipline of geology, providing theoretical support for oil exploration. Mineral sources are analyzed by chemical feature to infer the evolutionary history of the Earth (Bruand et al. 2014). In general, minerals are susceptible to weathering leading to chemical elements decay, which prevents accurate inference of sediment origin (Wang et al. 2016). Considering the wide distribution and stability of zircon. Zircon dating is the standard method for sediment source origin (Watts et al. 2016).

Typically, zircon data is stored in tables where each column is a key-value pair. Thereby, key-value pairs from multiple tables in the study area were integrated into one image based on the semantics of the key, which was visualized the differences in the chemical features distribution with different formations. However, the variability is difficult to identify because a single formation contains hundreds of samples (Wilson et al. 2017). Therefore, it is essential for the quantitative analysis of zircon chemical features.

The chemical features of zircon are defined by sedimentologists where a feature contains multiple isotopes. Machine learning is performed to calculate distance of zircon data, thus analyzing a single chemical features similarities (Bindeman and Melnik 2016). Subsequently, the similarities are ranked to ultimately improve the efficiency of zircon data processing. However, calculating single chemical feature similarity leads to conflict conclusions (Van Lankvelt et al. 2016).

Moreover, due to the different natural language descriptions of a key (e.g., U-pb and U-pb\(\Gamma\) are in semantic agreement), it is tough to recognize the semantics of keys. Therefore, most existing works are unable to recognize unknown keys that are not retrieved from the knowledge base (Chamiran et al. 2020).

NLP base on word embedding method computes the contextual word frequency of one word in a sentence to represent the semantics, which greatly improves the accuracy of table key recognition (Eslahi et al. 2020). Millions of texts are trained as pre-trained models for accurate semantic representation (Yurin et al. 2021). However, the pre-trained model such as Bidirectional Encoder Representations from Transformers (BERT) performed with low accuracy in the zircon table because of few labels for the table keys resulted in data sparsity (Yan et al. 2020). Furthermore, in the zircon table, some keys are not part of the real-world words (e.g., Age\(\gamma\)), which results in word embedding failing to accurately present semantics.

To solve these problems, a new framework is proposed named FM-zircon. First, character embedding extracts the semantics of the key to solve the problem of data sparsity. Then, Bi-directional Long Short-Term Memory(Bi-LSTM) and softmax classify the keys according to semantics (Hochreiter and Schmidhuber 1997; Rao et al. 2019). After the key-value pairs are aggregated, a similarity is used to calculate the hybrid features of multiple chemical features to solve conflicting conclusions. The main contributions of this article are as follows:

A hybrid chemical features similarity calculation method is proposed to solve the problem of conflicting conclusions.

Character embedding is used to extract the semantic features of the keys in the zircon table to improve recognition accuracy.

The rest of this paper is organized as follows. Related work is reviewed in Section 2. Section 3 introduces the FM-zircon. Section 4 gives the experimental evaluation and comparative analysis. Conclusions are drawn in Section 5.

Related work

Due to the stability and wide distribution of zircons, zircon dating is the standard method for hot issues in sedimentology (Nemchin and Pidgeon 1997). The similarity of zircons is a variable that measures the variability of the chemical features in different formations. Thousands of zircons tables were aggregated to calculate the similarity. Many existing studies combine zircon and machine learning. Delaigle et al. (2008) introduced U-Pb element age distribution density probability to calculate similarity, which improves the accuracy of zircon data analysis. Saylor et al. (2009) proposed a method based on the Kolmogorov-Smirnov test (K-S test) for calculating age distribution variability by machine learning. However, these methods were unable to analyze zircon data quantitatively.

Vermeesch (2012) proposed a method to reduce the dimension of K-S test results with Multidimensional scaling(MDS), which improves the efficiency of calculating the similarity. Sharman and Malkowski (2020) proposed tool with all the chemical characteristics analysis function and the final conclusion is observed by manual. However, there are conflicts between the different chemical features due to sedimentation, which leads to inaccuracy of similarity (Yongvanich et al. 2019). In this paper, a hybrid similarity calculation strategy is proposed to relieve the conflicts between different chemical features by fusing multiple chemical features.

Moreover, zircon data is stored in a table where each column is a key-value pair. Thousands of tables were aggregated according to the semantics of the keys to calculate the similarity (Ahmed 2008). Due to multiple natural language descriptions link to a key in the table, it is tough to recognize keys of tables by dictionary (Xu et al. 2010). Liu et al. (2005) built a knowledge base containing thousands of natural language descriptions groups by semantics to recognize keys. Julthep et al. (Nandakwang and Chongstitvatana 2016) linked Wikipedia data to web tables for classifying the keys with the Term Frequency Inverse Document Frequency(TF-IDF) algorithm. These methods fail to recognize unknown words, which were not included in the knowledge base.

Recently, deep learning is widely applied in the field of table key recognition because it extracts semantics accurately. Shaik et al. (2021) proposed a method base on knowledge graph where graphical neural networks were trained on a large corpus for extracting semantic links between keys and words. Luzuriaga et al. (2021) trained in Wikipedia to obtain links between words, which greatly improves the accuracy of table key recognition. Berant et al. (2019) classified keys by extracting semantics features of table contexts. These works are highly accurate on large corpora but suffer from data sparsity on small corpora. This paper uses character embedding to extract the semantic features of the keys to address this problem.

Design of FM-Zircon

In this section, FM-zircon, a novel framework for zircons similarity calculations, is introduced. Figure 1 illustrates the structure of FM-Zircon. Firstly, according to the character sequence of the key in the knowledge base, NLP classifies input according to key, aiming to extract dense semantic features of input for detecting unknown words. Furthermore, columns with the same semantic are merged. Finally, a similarity computation method is proposed for hybrid zircon chemical features based on MDS and visualize the result.

Fig. 1
figure 1

Structure of FM-zircon

Table key recognition

The overall operation process of table key recognition is shown in Fig. 2. Firstly, according to the order of the key in the knowledge base, character embedding is generated for modeling the semantics feature of character. After that, Bi-LSTM is used to help extract semantic features of the character sequence. Finally, softmax classifies input according to key, aiming to extract dense semantic features of input to detect unknown words.

Fig. 2
figure 2

This is the process of table key i recognition, where age\(\gamma\) is labeled age

Character embedding of keys

The table key recognition is essentially a classification problem, where according to methods like embedding achieves great results. Word embedding is efficient for indicating the word object coordinates in semantic space but unfit for sparse data. The number of keys in the zircon table is too small (a few hundred), which results in sparse data for word embedding, it is not possible to represent the semantics accurately. The character embedding statistics table key in the character sequence relationship (43 in total), which greatly relieves the sparsity of the data. In light of this, we use character embedding, a lightweight method, which is used for extracting the semantics feature of character. The details of character embedding are described as follows. Let \(Ch=\left\{ Ch_{1},Ch_{2},.....,Ch_{M}\right\}\) represent the one-hot code set of characters, \(KeyBase =\left\{ KeyBase_{1}.KeyBase_{2}.....KeyBase_{N}\right\}\) represent the key set of zircon tables, \(Chem = \left\{ Chem_{1},Chem_{2},.....,Chem_{M}\right\}\) represent the character embedding set of characters, C represents window size of the context, respectively, let \(Ch_{i}=\left\{ Ch_{i1},Ch_{i2},.....,Ch_{iL}\right\}\) denote character set of \(i^{th}\) key \(KeyBase_{i}\).

First, one-hot code initializies the weights of the model. Then, a full connection layer is used to extract the semantics of characters. Finally, a softmax layer is normalized result and the semantics features of all the characters in \(KeyBase_{i}\) is given by

$$\begin{aligned} p(ch_{io}\Vert ch_{ij})=\frac{e^{Chem_{io}*Chem_{ij}}}{\sum _{j}e^{Chem_{iL}*Chem_{ij}}}, \end{aligned}$$
(1)

where L denotes the length of \(KeyBase_{i}\).

In the training stage, entropy loss function calculates distribution difference of the semantics. The formula is given by

$$\begin{aligned} Loss_{em}=\sum \limits _{j}p(ch_{io}\Vert ch_{ij})log_{2}(chem_{o})). \end{aligned}$$
(2)

Bi-LSTM for extracting sematics

After encoding characters, it is crucial to extract the potential link between character embedding and key. In recent years, Recurrent Neural Networks(RNN) have been widely applied in various tasks of NLP due to the ability to extract correlations between sequences. However, some sequences of table keys with lengths greater than 20 lead to network vanishing gradient and exploding gradient. Compared with RNN, LSTM relieves the gradient elimination due to filtering the useless information in the previous text. Nonetheless, the correlation between language sequences is bidirectional but LSTM extracts the features of unidirectional text (Istiake Sunny et al. 2020). Bi-LSTM is composed of two LSTM blocks with opposite directions which extracts semantic features in the bidirectional. Thus, Bi-LSTM extracts the feature of characters sequence from a key in this paper. For a character embedding, The output of the forward LSTM is defined by

$$\begin{aligned} outf_{i}= LSTM(Chem_{i},Chem_{i+1}), \end{aligned}$$
(3)

where \(outf_{i}\) is the semantic features of the previous text. To extract the semantic features of the later text, a backward LSTM is utilized. The formula is defined by

$$\begin{aligned} outb_{i}= LSTM(Chem_{i},Chem_{i-1}), \end{aligned}$$
(4)

The final semantic features \(out_{i}\) are defined by

$$\begin{aligned} out_{i}= outb_{i}+outf_{i}, \end{aligned}$$
(5)

Softmax for classification keys

Three fully connected layers are used to normalizing the output of Bi-LSTM. Finally, softmax gains the probability of whether the word is a key or not. The formula of softmax is given by

$$\begin{aligned} P=\frac{e^{out_{i}}}{\sum _{j} e^{out_{j}}}, \end{aligned}$$
(6)

where P is the list predicted of value.

Table aggregation

The process of table aggregation is shown in Algorithm 1. In Algorithm 1, l represents the number of tables, table[i][0] is the first row of table[i] and r means the number of columns of table[i][0]. First, for each table, the first row is searched to obtain a key-list. Then, FM zircon judges whether the semantics of table keys are retrieved from the knowledge base. If so, the key is matched; otherwise, the network classifies keys larger than the threshold, which is set to 0.9. Finally, all key-value pairs are grouped by key with the same semantic.

figure a

Hybrid similarity calculation

Due to weathering, single chemical features are not adequate for dating sediment. For accurate chemical features, a hybrid similarity calculation is proposed as Fig. 3 shown. First, cumulative distribution is used to encode isotope features. Afterward, all the feature vectors are summed in proportion of 1 :1. Finally, MDS calculates and visualizes the difference between two formations.

Fig. 3
figure 3

The structure of hybrid similarity calculation

Encoder of isotope

First, all isotope data are normalized between 0 to 100. Let \(Group=\left\{ group_{1},group_{2},.....,group_{M}\right\}\) represent the formation set, \(groupf_{k,b}\) represent the two dimensional feature matrix of \(group_{i}\), k indicates the isotope type, then \(groupf_{k,b}\) is defined by

$$\begin{aligned} groupf_{k,b}=\frac{Content}{L_{num}}, \end{aligned}$$
(7)

where Content is the total number of zircon less than b and \(L_{num}\) is the total number of zircon.

Afterwards, groupf is fused in the k dimension into a final feature vector. The formula is given by

$$\begin{aligned} V(group_{i}) = \sum \limits _{k=1}^Kgroupf_{k}, \end{aligned}$$
(8)

where K is the sum of isotope.

MDS for calculating Zircons similarity

The similarity calculation is intrinsically a dimensionality reduction problem, where a method based on MDS arrive great results. According to (8), the difference between \(group_{i}\) and \(group_{j}\) is given by

$$\begin{aligned} dense_{i,j}=\vert \vert V(group_{i})-V(group_{j}) \vert \vert , \end{aligned}$$
(9)

After normalization, (9) can be further expressed as

$$\begin{aligned} p_{i,j}=1-\frac{dense_{longest}}{dense_{i,j}}, \end{aligned}$$
(10)

where \(p_{i,j}\) is the similarity between \(group_{i}\) and \(group_{j}\). \(dense_{longest}\) is the longest distance between two formations in Group.

Results

In this section, experiments are performed to prove the effectiveness of FM-zircon. First, the dataset is described and then the performance of table key recognition is studied. What’s more, a method to evaluate the performance between hybrid similarity and single similarity is proposed. Finally, zircon similarity and spatial information was visualized.

Datasets

We manually extracted 100 zircon tables from the sedimentological literature. The total number of keys are more than 300 and there are 16 kinds of labels. We choose 270 keys as the train set and 30 as the test set. The labels are shown in Table 1.

Table 1 Labels of table key recognition

Furthermore, we manually extracted three common chemical features in sedimentology from 50 papers. These features are the U-pb feature, the Hf feature and the feature of the Clastic Composition (CC), respectively. Examples of datasets are listed in Table 2.

Table 2 Labels of similarity calculation

In Table 2, ID represents the universal unique identifier of the data set, paperid marks the extracted sedimentological literature, and formation1 and formation2 denote the two groups two groups being compared in the literature. HF, UPB and CC represent three chemical characteristics of zircon. Hf,Upb, CC represent the three chemical characteristics of zircon. Result denotes the final conclusion of the paper. True indicates that the two groups are similar in this chemical characteristic dimension, and false is just the opposite. For instance, a record with id 0 means that in paper 0, group Duba and group Dingqinghu are similar in the dimension of U-pb.

Experiment for table key recognition

Contrast test of table key recognition was performed for different models, results are listed in Table 3. We compare our Character embedding+Bi-LSTM+softmax method with the following four table key recognition methods: method base on knowledge base, Word embbeding+Bi-LSTM+softmax, BERT+softmax and BERT+Bi-LSTM+softmax.

Table 3 Result for table key recognition

From Table 3, it can be seen that FM-zircon greatly improves the recall rate compared to the method based on the knowledge base. The reason is that character embedding extracts accurate semantic features to recognize unknown keys. Furthermore, the recall and accuracy of the method based on the pre-trained model is low because of data sparsity in small sample. Moreover, compared with word embedding, character embedding is more accurate due to alleviating data sparsity. It indicates that FM-zircon mitigates data sparsity.

Experiment for similarity calculation

A contrast test of similarity calculation was performed for different chemical features. The similarity of chemical features between the two groups was calculated and considered similar if it was greater than 0.75. Finally, the results were compared with the descriptions in the sedimentological literature and the accuracy was calculated. The result is listed in Table 4.

Table 4 Result for similarity calculation

As shown in Table 4, compared to individual chemical feature similarity, hybrid similarity improves accuracy. For instance, there was a 5\(\%\) increase in hybrid similarity in group 6 compared to group 1. In addition, group 7, which merges the three chemical features is the highest in terms of accuracy. The chemical information of zircons is missing due to weathering. Therefore, it leads to inaccurate calculation of a single chemical feature. The reason is the conflict between different chemical features. Hybrid similarity extracts multiple chemical features to resolve conflicts and improves the accuracy of zircon similarity.

Visualization

As illustrated in Fig. 4, all zircon data was aggregated and mapped to BaiduMap based on formations locations, where each mark is a formation.

Fig. 4
figure 4

Zircon data visualization

The similarity of zircons is shown in Fig. 5. The distance between two points represents the similarity with closer distance indicating higher similarity.

Fig. 5
figure 5

Zircon similarity visualization

Conclusion

FM-zircon was designed to rapidly extract zircon age features. NLP extracts semantic features to classify table keys. In addition, the hybrid similarity is calculated to alleviate conflicts between different isotopes, which is an important component of our FM-zircon. In the future, we will work on extracting table structure by table contextual features with few shot learning.