Keywords

1 Introduction

The number of resources of online learning platforms is growing rapidly with the rapid development of the Internet. At the same time, people’s demand for more intelligent retrieval of course information on learning platforms is also increasing day by day [1]. With the explosive spread of information, there is an urgent need for effective information screening methods. While searching for relevant information quickly and accurately, discover hidden and higher-value information from the data. In this case, various intelligent search technologies, especially artificial intelligence technologies have been vigorously developed and widely used [2].

Natural language processing in intelligent search is dedicated to the ability of machines to understand and generate human languages. The ultimate goal is to make computers or machines as intelligent as humans in understanding languages [3]. Natural language processing has two research branches, one is based on grammatical rules, and the other is based on probability. Probability-based research methods With the popularity of finite state machines and empirical methods, coupled with the improvement of computer storage capacity and computing speed, natural language processing has expanded from a few application fields such as machine translation earlier to more fields, Such as information extraction and information retrieval. Various processing techniques based on different rules have also been integrated and used by researchers. The continuous creation of a variety of corpora based on statistics, examples and rules has injected a lot of vitality into the research of natural language processing [4]. Although my country’s research on natural language processing started late, the current gap with international standards has been narrowing. Corpora and knowledge bases corresponding to Chinese are constantly being built, and advanced research results on semantic segmentation and syntactic analysis have also been developed. It keeps emerging [5].

Compared with English text classification, the research of Chinese text classification started late, but the speed of development is extremely fast. For decades, many domestic scholars have proposed many excellent classification algorithms when studying Chinese classification, which has laid a solid foundation for the research and development of Chinese text classification. For example, Li Xiaoli and Liu Jimin of the Institute of Computing Technology of the Chinese Academy of Sciences applied the conceptual reasoning network to text classification [8], resulting in a text recall rate of 94.2% and an accuracy rate of close to 99.4%. Literature [9] proposed a hypertext coordination classifier based on the study of KNN, Bayesian and document similarity, with an accuracy rate of close to 80%; Literature [10] studied the text classification of independent languages, and used vocabulary and category The amount of mutual information is the scoring function, considering single classification and multi-classification, so that the recall rate is 88.87%; Literature [11] combines word weights with classification algorithms, which are implemented in closed test experiments based on VSM The classification accuracy rate reaches 97%.

In 2018, the release of the BERT model [6] is considered to be the beginning of a new era in the field of natural language processing (NLP). It broke the records of many tasks in the field of natural language, and showed power in all major tasks of NLP. Gesture. In recent years, the research on Chinese text classification based on the BERT model has received extensive attention from scholars. Hu Chuntao [12] used the transfer learning strategy to apply the model to public opinion text classification tasks, Yao Liang [13] used BERT and domain-specific corpus to classify TCM clinical records, Zhang XH [14] used BERT entity extraction to extract the concept of breast cancer and its attributes; Jwa H [15] proposed exBAKE, which uses the BERT model to analyze the relationship between news headlines and content to detect fake news text. Some studies have also begun to implement pre-training models based on Chinese literature. For example, Wang Yingjie et al. [16] based on the pre-training process of BERT, mainly based on Chinese Encyclopedia, constructed a pre-trained language representation model for scientific and technological text analysis for classification experiments.

Considering that when the BERT model calculates semantic similarity, it needs to enter two sentences into the model at the same time for information exchange, which causes a lot of computational overhead. In this paper, the Sentence-BERT language model is used for pre-training, combined with the concise and effective Siamese network (Siamese), to complete the keyword matching of the generated sentence vector features. Experimental simulations are carried out on MOOC online learning resources and State Grid online learning resources. The experimental results show that the Sentence-BERT model is better than the BERT model in matching speed and matching accuracy.

2 The Sentence-BERT Model

The Sentence-BERT model [7] is improved based on the BERT (Bidirectional Encoder Representations from Transformer) model. Although BERT and its enhanced models have achieved good results in the regression tasks of sentence pairs such as various sentence classification tasks and text semantic similarity, the excessive overhead restricts their actual application scenarios. In addition, although the BERT model can directly map the sentence vector to the vector space, and then generate a vector that can represent the semantics of the sentence through some other processing, the actual use effect is not ideal, and its own structure makes it similar to the semantics. Unsupervised degree tasks such as degree search and clustering lack applicability. Literature [7] is based on the improvement of the BERT network model and proposes the Sentence-BERT (SBERT) network structure, which uses the twin network or triple network structure to complement the advantages of BERT, so that the generated sentence embedding vector can better represent the semantics Features, and can be applied to large-scale semantic similarity comparison, clustering, and semantic information retrieval.

2.1 Sentence Vector Generation Strategy

The Sentence-BERT model defines three strategies for obtaining sentence vectors, which are mean pooling, maximum pooling and CLS vectors. 1) Mean pooling: all word vectors in the sentence are averaged, and the mean vector is used as the sentence vector of the whole sentence; 2) Maximum pooling: all word vectors in the sentence are subjected to maximum value operation, and the maximum value vector is used as The sentence vector of the whole sentence; 3) directly call the “CLS” mark in BERT as the vector representation of the sentence. Literature [7] gives the experimental results of using three sentence vector generation strategies, as shown in Table 1. It can be seen that the results of the mean strategy are the best on different data sets. Therefore, this article chooses the mean strategy to obtain the feature vector of the courseware name.

Table 1. Experimental comparison of three pooling strategies

2.2 Model Objective Function

Compared with the BERT model, the Sentence-BERT model uses a twin network or a triplet network to update the initial weight parameters of the model, so as to achieve the purpose of the generated sentence embedding vector with semantics. According to different tasks, different objective functions are set.

Classification Objective Function

As can be seen from Fig. 1, the classification objective function inserts sentences passing through the Bert Model and the pooled layer into vectors \({\mathbf{u}}\) and \({\mathbf{v}}\) and the vector difference between them \(\left| {{\mathbf{u}} - {\mathbf{v}}} \right|\), splicing them into a vector, and then multiplying it by a trainable weight parameter \(W_{t} \in R^{3n \times k}\), where n is the dimension of the sentence vector and k is the category number. Cross-entropy loss function is used in training optimization. The classification objective function is defined as:

$$ o = soft\max (W_{t} (u,v,|u - v|)) $$
(1)
Fig. 1.
figure 1

Flow chart of classification objective function

Regression Objective Function

Calculate the cosine similarity of the embedding vector sum of two sentences, and the calculation structure is shown in Fig. 2. The mean square error loss function is used during training optimization.

Fig. 2.
figure 2

Flow chart of regression objective function

2.3 Pre-training and Fine-Tuning of the Model

The SBERT model uses the joint data set ALLNLI, which is composed of two data sets, SNLI and MultiNLI, during pre-training. Among them, SNLI has 570,000 artificially labeled sentence pairs, and the tags are divided into three types: opposition, support and neutral; MultiNLI is an upgraded version of SNLI, which has 430,000 sentence pairs, mainly including spoken and written texts. The format and labels of the two data sets are uniform.

In the experiment, for each iteration, 3 types of Softmax classification objective functions are used to fine-tune SBERT. Each batch size is set to 16, and the Adam optimizer with a learning rate of 2 is used for optimization. The sentence vector generation strategy defaults to the mean strategy.

3 Method of Generating Sentence Embedding Based on SBERT

First, use an existing pre-training model such as BERT to fine-tune the data set using natural language inference and instantiate it, and map the tag vector of the sentence to the embedding layer of BERT for output. Then, for the output access pooling layer, such as the mean strategy or multiple pooling combination strategies, the sentence embedding vector is pooled. In the actual operation process, the sentence vector converter is formed by two modules, word_embedding_model and pooling_model. Each sentence is first passed through the word_embedding_model module, and then the sentence embedding vector with a fixed length is output through the pooling_model module.

Subsequently, we can specify a training data loader, use the NLIDataReader module to read the AllNLI data set, and generate a data loader suitable for training the sentence converter model. To calculate the training loss, Softmax can be used for normalized classification. These generated sentence embedding vectors can be applied to a variety of downstream tasks such as text clustering and semantic similarity analysis.

Among them, we can also specify a verification set to evaluate the sentence embedding model. The validation set can be used for testing on some invisible data. In this experiment, the validation set of the STS benchmark data set is used to evaluate the model.

4 The Experimental Results and Analysis

In this paper, data collection is carried out on the training platform of State Grid and MOOC network of China, and the training course data and MOOC online course data are obtained. In the training course data of State Grid, each course contains 13 data contents such as courseware number, courseware name, production time, and cumulative number of learners. There are a total of 1214 items of statistical course names. MOOC online courses include part of the catalogue of online MOOCs of Chinese universities from 2016 to 2018, with a total of 1352 course name information. For the training course files of State Grid, this article only needs the name information of the courseware. First use the script to take out the name information of the courseware and export it to a new text file named wenben1.txt. As for the MOOC online course catalog file, it can be directly imported into the new text corpus, named wenben2.txt. Part of the course catalog of the two data sets is shown in Figs. 3 and 4, respectively.

Fig. 3.
figure 3

National grid training course statistics table

Fig. 4.
figure 4

Statistics of MOOC online courses

The classification of course names in this article belongs to short text analysis. Here, two analysis methods, K-means clustering and semantic search, are used to analyze and research the National Grid Courseware Name Text Data Set and MOOC Online Course Name Text Data Set to further illustrate the effectiveness of the Sentence-BERT model. The hardware experiment environment in this article is built and processed with the help of the Google Colab online computing platform. The software adopts transformers 2.8.0 and above, and installs tqdm and torch1.0.1 and above.

4.1 K-means Text Clustering Experiment Results and Analysis

First, calculate the sentence embedding vector for each course name short sentence in the National Grid courseware text database (wenben1.txt) and MOOC online course text database (wenben2.txt); then, use the first sentence in the python3 package library. The tripartite machine learning module sklearn performs K-means clustering (K is the number of self-clusters, which can be set to different values). Among them, the sentence embedding vector is generated by using a trained Sentence Transformer model, and a specific Numpy array containing 768-dimensional embedding is generated corresponding to each sentence information, as shown in Fig. 5.

For the national grid courseware text database and MOOC online course text database, K values were set to 10, 20, and 50 respectively for cluster analysis and comparison experiments. Some experimental results are shown in the figure below. It can be seen from the experimental results that the larger the value K of K-means text clustering is selected, the finer the division of each group of clusters, but too large a value will also make the grouping confused. tendency. This is because the basis of clustering depends entirely on the sentence embedding vector generated above. If the amount of training data in the previous period is not very sufficient, the information will not be fully reflected in the embedded information, and ultimately result in insufficient differentiation density (Figs. 6, 7 and 8).

Fig. 5.
figure 5

The generated sentence embedding vector array

Fig. 6.
figure 6

10 clustering results of the National Grid courseware text library

Fig. 7.
figure 7

20 clustering results of MOOC online course text library

Fig. 8.
figure 8

50 clustering results of MOOC online course text library

4.2 Sentence-BERT Semantic Search Experimental Results and Analysis

Semantic search is the task of finding sentences that are similar to a given sentence. As with cluster analysis, all sentences in the Corpus are embedded with the corresponding sentences, then the input query sentences are embedded with the same method, the Scipy toolkit in Python is used to search the Numpy array of the Corpus to retrieve the content most similar to the query statement, and display the first 8 results.

In the same way, semantic search experiments are carried out on the State Grid courseware text library and MOOC online course text library, and the results are shown in Figs. 9 and 10. It can be seen from the figure that, for each query sentence, whether it is a single text or a combination of multiple texts, when calculating the embedded vector of the query sentence, it is treated as a single line of text data for processing and calculation. Under each query sentence, the score is calculated according to the semantic similarity. The higher the score, the more similar the semantics of the query sentence of the retrieval text, and the 8 closest retrieval information texts are given according to the score.

Comparing the two sets of semantic search test results, it is found that it is indeed possible to perform semantic matching calculations based on the sentence semantic vector rather than the character matching degree of the phrase in the sentence. However, as the experimental results show, the results of the semantic search are good or bad. The analysis believes that there are two possible reasons: first, the input training corpus text has a low degree of correlation with the text tested in the experiment, and the machine does not learn enough about the information contained in the sentence semantics; second, query and retrieval The amount of sentences in the text is small and the distribution is uneven, resulting in errors in semantic matching.

Fig. 9.
figure 9

Semantic search test results of the State Grid courseware text library

Fig. 10.
figure 10

Semantic search test results of MOOC online course text library

5 Conclusion

This paper presents an intelligent search method based on natural language processing technology. Sentence-BERT language model is used for pre-training to improve the learning and reasoning ability of the machine. Based on the feature of sentence vector generated by twin network and the embedded vector generated by twin network, clustering and semantic search are carried out respectively. In the Google Colab platform for the two tasks of the application of experimental analysis, achieved short-text intelligent search requirements.