Keywords

1 Introduction

Named Entity Recognition (NER) is a key task in natural language processing [19], and its goal is to identify entities representing names of people, places, and organizations in text. NER has wide application scenarios, such as information extraction [9], machine translation [3], question answering system [18], etc.

In recent years, NER has made great progress on high-resource languages, and many deep learning methods have achieved high accuracy [12, 22, 28]. However, the training of these methods relies heavily on large-scale datasets [21]. Consequently, the most significant advances in NER have been achieved in resource-rich languages such as English [20], French [25], German [4] and Chinese [8]. In contrast, in low-resource languages, the effect of NER is still poor, which limits understanding and processing of these languages to some extent. The biggest challenge in achieving high-quality NER is usually the lack of language resources, such as manually annotated datasets and pre-trained models.

The Uzbek language studied in this paper is one of the low-resource languages. The population of this language is about 30 million, most of them are located in Uzbekistan, and the rest are scattered in Central Asian countries and Xinjiang, China, but relatively little research has been done on natural language processing for this language. The difficulty of realizing Uzbek NER lies in the limited scale of academic datasets of the language, and the lack of large-scale annotated corpus. In order to solve this problem and promote the research of Uzbek NER, we constructed a large-scale human-annotated Uzbek named entity corpus. To address the issue of entity sparsity, we reviewed the corpus and only kept sentences that contained three or more entities. It contains nearly 11,366 sentences, covering three entity types: person name, place name, and organization name.

NER can be solved by various methods, such as sequence labeling [24], span enumeration [22], hypergraph [16] and sequence-to-sequence [28] and grid tagging [12]. Because the main goal of this paper is built a Uzbek NER dataset and set up a strong baseline, we select one of the state-of-the-art (SoTA) NER model based on grid tagging as our baseline. Grounded on this model, we consider the characteristics of Uzbek and extend it by incorporating unique affix feature information of the language and expanding the training corpus by translating Cyrillic text into Latin.

Moreover, BERT [6] and BiLSTM [10] are used to provide contextualized word representations, combine them with affix feature representations to form a 2D grid of word pairs, and use multi-grained 2D convolutions to reproduce word pair representations. Finally, we employ a common predictor using dual-effect and multi-layer perceptron classifiers to generate all possible entity mentions. Our results show significant performance improvements in Uzbek NER.

In comparison to four baseline models, our proposed model Footnote 1outperforms them by improving F1 scores by 0.34%, and the grid-tagging-based method performs better due to its attention to both entity boundary and information inside. Our model improves performance by 0.46% and 0.58% when adding affix features and augmenting the corpus with translation data, respectively.

Our contributions are as follows: 1) We constructed the first high-quality Uzbek NER corpus; 2) We introduced affix features and adopted data augmentation methods to improve the performance of NER in Uzbek; 3 ) Our model outperformed existing methods, achieves the state-of-the-art performance, and sets a new benchmark for the Uzbek NER task.

Our work shows that for low-resource language NER tasks, data augmentation and feature engineering are also two improvement directions. Abundant data and knowledge can help the model to learn a more generalized language representation, overcome the limitations of data scarcity, and thus greatly improve the performance of the model. This provides a strong reference for further advancing low-resource language processing.

2 Related Work

In low-resource scenarios, named entity recognition faces some challenges, such as the lack of large-scale annotation data, the quality of annotation data, and the consistency of annotation standards. In order to solve these problems, the research of low-resource entity naming recognition has emerged in recent years.

In past studies, many researchers have explored the use of cross-language transfer to solve the problem of low-resource named entity recognition. These studies show that using existing high-resource language annotated data to train a model and then transferring the model to a low-resource language can effectively improve the performance of named entity recognition in low-resource languages. For example [11] and others used the method of transfer learning to perform named entity recognition on Indonesian, and the results showed that the method performed better than the baseline model on low-resource languages. Similarly, Sun et al. (2018) [23] migrated an English entity recognition model to Hungarian and Italian, and achieved good results.

In addition to cross-language transfer, some researchers have explored human-annotated low-resource named entity recognition methods. This approach trains high-quality named entity recognition models by leveraging expert annotators to annotate a small amount of data. For example, Al-Thubaity et al. (2022) [2] used human-annotated Arabic datasets to train named entity recognition models and achieved good results. Similarly, Truong et al. (2021) [26] used a manually annotated Vietnam dataset to train an named entity recognition model and achieved higher performance than the baseline on the test set. In addition to the above methods, some researchers have explored the method of combining cross-language transfer and human annotation. This method uses cross-language transfer to leverage knowledge of high-resource languages, and then uses a small amount of human-labeled data to tune the model to achieve better results. For example, Adelani et al. (2021) [1] used cross-lingual transfer and human-annotated data to solve the problem of named entity recognition in African languages and achieved higher performance than the baseline.

Uzbek belongs to the Altaic language family and has its own grammar and rich morphological structure. Therefore, there are some special problems and challenges in the field of Uzbek NER. Although there are some researches on Uzbek natural language processing, it does not specifically involve the field of named entity and recognition. In order to fill this gap, we have done three aspects of work. First, we constructed a news-based Uzbek named entity and recognition corpus. Second, we increased the number of entity placeholders for Uzbek Cyrillic to Uzbek Latin multilingual conversion of the corpus to increase the diversity of the data set, reduce the risk of over fitting, and improve cross-language performance. Thirdly, we conducted various experiments on the corpus and incorporated affix features to provide morphological-level knowledge, with the aim of enhancing the accuracy and robustness of our entity recognition system.

3 Dataset Construction

3.1 Data Collection and Preprocessing

First, we locked the most visited Uzbek language news websiteFootnote 2 from the Uzbekistan websiteFootnote 3 browsing list. We then analyzed the website to determine the information that we wanted to crawl, which included the title, text, time, and author of news articles. 1,000 news articles were collected by web crawler. The data was then cleaned to remove HTML tags, useless characters, and other extraneous information. This was done to ensure that the data was in a consistent format and ready for subsequent analysis. Then, the cleaned data was stored in a database or a text file to facilitate subsequent analysis and processing. Finally, we obtained the original corpus consisting of 49,019 sentences. The flowchart is shown in Fig. 1.

Fig. 1.
figure 1

Our overall workflow for dataset construction.

Fig. 2.
figure 2

Annotation schema and platform.

3.2 Data Annotation and Postprocessing

Our corpus is annotated by 4 annotators, two men and two women. They are graduate students, linguistics majors, non-native speakers but proficient in Uzbek. After nearly 3 months, the annotation was completed on the doccano platform. The tool supports visualization and is easy to use. A typical example is shown in Fig. 2.

Specific steps are as follows: First, we trained the annotators, informed them of the purpose of annotation, and the specific task content, and showed annotation cases for them to learn and discuss, answered their questions, and finally explained the operating specifications of the annotation system, and precautions.

Secondly, we divided the data into 4 parts, and divided the annotators into 2 groups with male and female collocations. Each person marked a piece of data. After each completed, the members of the same group exchanged data with each other for cross-labeling. Inter-annotator agreement was 0.89, as measured by span-level Cohen’s Kappa [5]. We organize four people to discuss the inconsistency and until a consensus is reached.

Finally, we traversed all the data to check and corrected a few errors. The annotator training process took about a week. After annotator training, we provide each annotator with a batch for a period of one month. Then, we asked them to exchange data with a team member to start a new round of annotations for a month, unless they were tired, bored, or sick. We do this to ensure the quality of annotations. They were checked for consistency after two months, inconsistencies were dealt with together, and then another review was conducted, and thus, after nearly three months, we finally obtained a golden corpus with 49,019 labels consisting of 879,822 tokens, including 24,724 names of people, 35,743 names of places, and 25,697 names of institutions.

Fig. 3.
figure 3

Our model architecture. \(\boldsymbol{H}^{x}\) and \(\boldsymbol{H}^{c}\) represent the original text embedding and the affix sequence embedding respectively. \(\bigoplus \) and \(\bigotimes \) represent element-wise addition and concatenation operations. Both the convolutional layer and the collaborative prediction layer come from the SoTA model [12].

Due to the sparsity of entity data in news sentences, we retain sentences with three or more entities in a sentence. After screening, we got a corpus of 11,366 sentences consisting of 402,707 tokens. The longest sentence has 56 words, the shortest sentence has only 5 words, and the average length is 35.4 tokens. The corpus contains a total of 78,385 named entities. Among them, the longest name of a person is composed of 5 tokens, the name of a place is composed of 4 tokens, and the name of an organization is composed of 14 tokens. We randomly divide the gold-marked corpus into training sets/verification set/testing set, the ratio is 6/2/2, and the statistical results of the corpus are shown in the Table 1.

Table 1. Dataset statistics

4 Method

Our model is an improvement on \(\textrm{W}^{2}\textrm{NER}\) [12], including embedding layer, convolution layer and co-predictor layer. The difference is that the affix feature is integrated into the model. Our model architecture is shown in Fig. 3.

4.1 Features

Before introducing the model, we briefly introduce the construction of the affix sequence. Uzbek language contains many affix features, which are helpful for identifying entities. Therefore, we count the corresponding affixes according to the type. During the labeling process, we found that Uzbek personal names usually end with “ov", “ova", and “lar"; place names usually end with “stan", “shahr", “ko’li", etc.; institution names often end with “markaz", “kompaniya", “tashkilot" and other affix endings. These affixes include 64 place name prefixes, 23 personal name prefixes, 24 personal name suffixes, and 105 organizational name suffixes. Based on our statistics, we use four special tags to represent affix features, namely [PRE-PER] (PER class prefix), [POST-PER] (PER class postfix), [POST-LOC] (LOC class postfix) and [POST-ORG] (ORG class postfix). If there is no affix in the token, it will be filled with [PAD]. In this way, we can construct the affix sequence corresponding to the original text, such as the example shown in the bottom part of Fig. 3, an original text and its corresponding affix sequence.

Fig. 4.
figure 4

Flowchart for data translation. Input is Cyrillic, and finally translated into Latin.

4.2 Data Translation

Since Uzbek includes Latin and Cyrillic, inspired by Liu et al. [14], in the data preprocessing stage, we consider translating Cyrillic to Latin to augment the training corpus. The specific translation process is divided into three steps: first, replace the entities in the Cyrillic sentence with special tags, and then translate into Latin; then translate the Cyrillic entities into Latin one by one; finally fill in the Latin entities into the translated Latin sentence. The whole process is translated using the Google translation model, and the overall flow chart is shown in Fig. 4.

4.3 NER Model

Embedding Layer. The embedding layer is the same as \(\textrm{W}^{2}\textrm{NER}\), including BERT [6] and BiLSTM [10], but the input not only has the original text \(X=\{x_{1},x_{2},\ldots ,x_{n} \}\in \mathbb {R}^{n}\) of length n, but also the affix sequence \(C=\{c_{1},c_{2},\ldots ,c_{n}\}\in \mathbb {R}^{n}\). After the embedding layer, the original text embedding \(\boldsymbol{H}^{x}\) and the affix sequence embedding \(\boldsymbol{H}^{c}\) are obtained:

$$\begin{aligned} \begin{aligned} \boldsymbol{H}^{x}=\{\boldsymbol{h}^{x}_{1},\boldsymbol{h}^{x}_{2},\ldots ,\boldsymbol{h}^{x}_{n}\}\in \mathbb {R}^{n \times d_{h}}, \\ \boldsymbol{H}^{c}=\{\boldsymbol{h}^{c}_{1},\boldsymbol{h}^{c}_{2},\ldots ,\boldsymbol{h}^{c}_{n}\}\in \mathbb {R}^{n\times d_{h}},\\ \end{aligned} \end{aligned}$$
(1)

where \(\boldsymbol{h}^{x}_{i}\), \(\boldsymbol{h}^{c}_{i}\in \mathbb {R}^{d_{h}}\) are the representations of the i-th token, and \( d_{h} \) represents the dimension of a token representation.

After that, we sum \(\boldsymbol{H}^{x}\) and \(\boldsymbol{H}^{c}\) at the element level to get the text embedding \(\boldsymbol{H}^{s}=\{\boldsymbol{h}^{s}_{1},\boldsymbol{h}^{s}_{2},\ldots ,\boldsymbol{h}^{s}_{n}\}\in \mathbb {R}^{n\times d_{h}}\) that incorporates affix features. The subsequent process is the same as \(\textrm{W}^{2}\textrm{NER}\), so we will only briefly introduce it.

Convolution Layer. After obtaining the text embedding \(\boldsymbol{H}^{s}\) that incorporates the affix feature, the Conditional Layer Normalization (CLN) mechanism is used to generate a 2D grid \(\boldsymbol{V}\), where each item \(\boldsymbol{V}_{ij}\) in \(\boldsymbol{V}\) is a representation of a word pair \((x_{i}, x_{j})\), so:

$$\begin{aligned} \begin{aligned} \boldsymbol{V}_{ij} = \text {CLN}(\boldsymbol{h}_{i}^{s},\boldsymbol{h}_{j}^{s})=\gamma _{ij}\odot (\frac{\boldsymbol{h}_{j}^{s}-\mu }{\sigma })+\lambda _{ij},\\ \end{aligned} \end{aligned}$$
(2)

where \(\boldsymbol{h}_{i}\) is the condition to generate the gain parameter \(\gamma _{ij}=\boldsymbol{W}_{\alpha }\boldsymbol{h}_{i}^{s}+b_{\alpha }\) and bias \(\lambda _{ij}=\boldsymbol{W}_{\beta }\boldsymbol{h}_{i}^{s}+b_{\beta }\) of layer normalization. \(\boldsymbol{W}_{\alpha }\), \(\boldsymbol{W}_{\beta }\in \mathbb {R}^{d_{h}\times d_{h}}\) and \(\boldsymbol{b}_{\alpha }\), \(\boldsymbol{b}_{\beta }\in \mathbb {R}^{d_{h}}\) are trainable weights and biases respectively. \(\mu \) and \(\sigma \) are the mean and standard deviation across the elements of \(\boldsymbol{h}_{j}^{s}\).

Then word, position and sentence information on the grid is modeled, where \(\boldsymbol{V}\in \mathbb {R}^{n\times n\times d_{h}}\) represents word information, \(\boldsymbol{V}^{p}\in \mathbb {R}^{n\times n\times d_{h_{p}}}\) represents the relative position information between each pair of words, and \(\boldsymbol{V}^{r}\in \mathbb {R}^{n\times n\times d_{h_{r}}}\) represents the region information for distinguishing lower and upper triangle regions in the grid. They are concatenated them to get the position-region aware representation of the grid:

$$\begin{aligned} \begin{aligned} \boldsymbol{Z}=\text {MLP}_{1}([\boldsymbol{V};\boldsymbol{V}^{p};\boldsymbol{V}^{r}])\in \mathbb {R}^{n\times n\times d_{h_{z}}},\\ \end{aligned} \end{aligned}$$
(3)

Finally, the multiple 2D dilated convolutions (DConv) with different dilation rates are used to capture the interactions between the words with different distances, formulated as:

$$\begin{aligned} \begin{aligned} \boldsymbol{Q} = \text {GeLU}(\text {DConv}(\boldsymbol{Z})),\\ \end{aligned} \end{aligned}$$
(4)

where \( \boldsymbol{Q}\in \mathbb {R}^{N\times N\times d_{q}} \) is the output and \(\text {GeLU}\) is a activation function.

Co-Predictor Module. Finally, the word pair relationship is predicted by the co-predictor, which includes the MLP predictor and the biaffine predictor. Therefore, we take these two predictors to calculate the two independent relationship distributions \( (x_{i},x_{j}) \) of word pairs at the same time, and combine them as the final prediction. For MLP, the relationship score of each word pair \( (x_{i},x_{j}) \) is calculated as:

$$\begin{aligned} \begin{aligned} \boldsymbol{y}^{'}_{ij}=\text {MLP}_{2}(\boldsymbol{Q}_{ij}),\\ \end{aligned} \end{aligned}$$
(5)

The input of the biaffine predictor is the input \( \boldsymbol{H}^{s} \) of the CLN, which can be considered as a residual connection. Two MLPs are used to calculate the representation of each word in the word pair \( (x_{i},x_{j}) \). Then, the relationship score between word pairs \( (x_{i},x_{j}) \) is calculated using a biaffine classifier:

$$\begin{aligned} \boldsymbol{y}^{''}_{ij}=\boldsymbol{s}_{i}^{\top }\boldsymbol{U}\boldsymbol{o}_{j}+\boldsymbol{W}[\boldsymbol{s}_{i};\boldsymbol{o}_{j}]+\boldsymbol{b}, \end{aligned}$$
(6)

where \( \boldsymbol{U} \), \( \boldsymbol{W} \) and \( \boldsymbol{b} \) are trainable parameters, and \( \boldsymbol{s}_{i}=\text {MLP}_{3}(\boldsymbol{h}_{i}^s) \) and \( \boldsymbol{o}_{j}=\text {MLP}_{4}(\boldsymbol{h}_{j}^o) \) represent the subject and object representations respectively. Finally, we combine the scores from the MLP and biaffine predictors to get the final score:

$$\begin{aligned} \boldsymbol{y}_{ij}=\text {Softmax}(\boldsymbol{y}_{ij}^{'}+\boldsymbol{y}_{ij}^{''}). \end{aligned}$$
(7)
Fig. 5.
figure 5

An example showing the process of identifying entities.

Decoding Algorithm. We decode entities based on two designed word pair relationships, which are (1) Next-Neighboring-Word (NNW) indicates that the word pair \((x_{i}, x_{j})\) belongs to an entity, and the next word of \(x_{i}\) in the entity is \(x_{j}\). (2) Tail-Head-Word-* (THW-*) indicates that the word in the row of the grid is the tail of the entity, and the word in the column of the grid is the head of the entity. * indicates the entity type.

We also provided an example in Fig. 5 to explain the process of identifying different types of entities. For example, for the PER entity “oydin", it can be known from the THW-PER relationship that “oydin" is both the head and the tail of an entity, so it itself is an entity with a length of 1 and its category is PER. Then, for the ORG entity “ozbekiston respublikasi oliy majlis", by using the NNW relationship with the subject “ozbekiston” and object “respublikasi”, we recognize “ozbekiston respublikasi” as a part of the entity. Similarly, “respublikasi oliy” and “oliy majlis” is also recognized in the same way. Then, by using the THW-ORG, we recognize “ozbekiston” and “majlis” are the head and tail of the entity, so that “ozbekiston respublikasi oliy majlis" can be recognized completely and its category is ORG.

5 Experiments

5.1 Experimental Setting

We conduct experiments on our UzNER (Latin) dataset to evaluate the effectiveness of our proposed model. If the token sequence and type of a predicted entity are exactly the same as those of a gold entity, the predicted entity is regarded as true-positive. We run each experiment three times and report their average value.

Our model uses bert-base-multilingual-cased [6] as the backbone network. We set a dropout of 0.5 on both the output representations of the BERT and convolutional module, and a dropout of 0.33 on the output representations of the co-predictor module, the learning rate of BERT and the learning rate of other modules are 1e-5 and 1e-3 respectively, the batch size is 12, and \(d_{q}\) can choose 64, 80, 96 and 128. The hyper-parameters are adjusted according to the fine-tuning on the development sets.

Table 2. Comparison with baseline models and ablation experiments. represent the best result in that column, represent the second best result in that column excluding ablation results.

5.2 Baselines

We use some existing models of different methods as our baseline models. All baseline models are trained using an expanded corpus after translation. In addition, since our corpus is in Uzbek, the backbone network uses multilingual pre trained models.

BiLSTM+CRF [10] is the most basic sequence labeling model. Due to the presence of discontinuous entities in the dataset, we use BIOHD [24] tags to decode the entities. BartNER [28] is based on the Seq2Seq method, and they use pre-trained language models to solve NER tasks. We use mbart-large-cc25 [15] as the backbone network. \(\textbf{W}^{2}\textbf{NER}\) [12] is based on a grid labeling method, which identifies all possible entities through word pair relationships. We use bert-base-multilingual-cased [6] as the backbone network. UIE [17] is a unified text-to-structure generation framework. UIE is not pre-trained in our experiments. We use mt5-base [27] as the backbone network.

5.3 Comparison with Baselines

The comparison results with the baseline models are shown in Table 2. We have the following findings: 1) Our model outperforms four baseline models. Compared with the method of Li et al. (2022) [12], our method improves the F1s by 0.34; 2) The grid-tagging-based method outperforms other methods, because the method not only pays attention to the boundary of the entity, but also pays attention to the information inside the entity. 3) The effect of BiLSTM+CRF is the worst, which is natural, because its structure is too simple compared to other models, and it can learn too little knowledge.

5.4 Ablation Studies

The results of the ablation experiments are shown in Table 2. We mainly analyzed the two improvement schemes that we proposed. First, when we remove the affix features, the performance of our model on F1s drops by 0.46, indicating that the fusion of affix features is effective and can better improve the performance of the model. We then train the model without translation data, and our model performance drops by 0.58 on F1s, which is natural, more data allows the model to learn more features. In addition, for the baseline model, we also train on the augmented data without translation, and the performance is also reduced.

Table 3. Performance comparison among different entity classes. and represent the best and second best results in that column excluding ablation results.
Table 4. Error analysis experiment. EBE and ETE represent Entity Boundary Error and Entity Type Error, respectively.

5.5 Performance Analysis on Different Entity Types

We also explored the effectiveness of our method and baseline method on three entity classes. The comparison results with the baseline model are shown in Table 3. First, our method outperforms all baseline models on LOC and ORG by 0.31 and 0.66 on F1s compared to the second-best results. The performance achieved on F1s is the second best compared to the baseline model, only 0.02 lower than the best performance.

In the lower part of Table 3, we also analyze the ablation results on different entity classes. First, with the removal of affix features, the performance of our model on all three types of entities degrades. Then, without training the model with translated data, the performance of our model drops on all three types of entities. Finally, the performance of the baseline model on all three types of entities also decreases when trained without translation data.

Fig. 6.
figure 6

The confusion matrix for error analysis. None represents non-entity. Numbers represent percentages. Rows and columns represent the gold and predicted results, respectively.

5.6 Error Analysis

We also performed error analysis to learn more about our model. The results are shown in the of Table 4. Most of the errors come from boundary errors, accounting for 99.65% of all errors, because entity boundaries are difficult to identify, which is a well-known problem in previous work [7, 13]. In addition, we also analyzed the proportion of different types of errors. Regardless of the type of error, the PER entity has the largest proportion of errors. This is because PER has higher text diversity and the model is more difficult to predict more PER entities. Finally, Fig. 6 is a heat map of the confusion matrix of error analysis. The diagonal line represents the proportion of correct recognition, so it is the highest proportion, which is natural. In addition, the proportion of the first row and the first column is next, which is reasonable, because the proportion of these two parts is equivalent to the boundary error, which is consistent with the results in Table 4.

6 Conclusion

Our study proposes a novel approach to enhance the state-of-the-art model for Uzbek NER by incorporating unique affix feature information of the language and expanding the training corpus by translating Cyrillic text into Latin. Our proposed model outperforms four baseline models with a significant F1 score improvement of 0.34%, demonstrating the effectiveness of our approach. The grid-tagging-based method is found to be superior to other methods due to its attention to both entity boundary and information inside. Our findings highlight the importance of incorporating unique language features and utilizing advanced neural network architectures for NER tasks. In the future, further exploration of other language-specific features and integration of cross-lingual transfer learning can potentially improve the performance of NER models for low-resource languages like Uzbek.