Keywords

1 Introduction

Word segmentation (WS) is a task of the natural language processing (NLP) area that consists of dividing a string into constituent parts for serving a given purpose. This task is similar to word tokenization, but differs as we will see more below. Depending on the linguistic context or the application domain, this task varies in taxonomy. In the present article, we perform a systematic literature review (SLR) of WS applied on texts written in the Latin alphabet.

The motivation of this work originated in experiences of the processing of legal texts in Portuguese. Due to errors in converting PDF file format to plain text, long spurious strings have emerged such as ‘decisãoanteriorjáservecomomandadodeprisãopreventivaeofício’ that should be corrected to ‘decisão anterior já serve como mandado de prisão preventiva’ (previous decision already serves as a warrant for custody). Looking for solutions to the problem, we found the nltk.tokenize library, which in turn has a sub-module nltk.tokenize.stanford_segmenterFootnote 1, but only supports Chinese and Arabic languages. In a prior exploratory research, we found some word segmentation tools in English with technical analysis, but without scientific benchmarksFootnote 2. These initial experiences motivated us to conduct a systematic literature review.

Word segmentation (WS) and word tokenization (WT) can be confused each other, as both produce sub-strings as a result. The difference is at the input strings and whether or not word delimiters (WDs) are supported, such as spaces or punctuation. In languages such as Portuguese or English, it is normal for the WT input string to be made up of several words separated by WDs and if not, WS can be used to get the tokens separated. In languages like Chinese, there are no WDs, so WS is most commonly employed. This way, WS can be used as a WT subtask if there is any string that needs to be segmented. Following, we focus on a formal description of the word segmentation problem (WSP) what can be defined as an optimization problem. A general formulation for it can be: given a string s, consisting of non-delimiting characters of words, find a split of s into a list of words \(W=\,<w_1,w_2,\ldots ,w_n>\), with \(w_1\cdot w_2 \ldots \cdot w_n=s\) and \(|w_i|\ge 1\), so that an objective function f(W) is optimized and a set of constraints are satisfied. There is a considerable amount of different WSP definitions in the literature, each one with a particular aim and set of constraints. A common and simple specialization of the general formulation is to ignore f (or make it constant) and to ensure that every \(w_i\) belongs to a given dictionary. Another specialization of the problem is to find a segmentation W with minimum number of splits. This can be formalized as: \(Minimize\,F(W)=|W|\) constrained to have every \(w_i\) belonging to a dictionary. It is also possible to deal with imprecision or errors in s. In that case, f(W) could measure how the terms in W deviate from their most similar words in a dictionary. A usual constraint for that case would be to enforce that every word in W is at most k characters different from its closest valid word in the dictionary. Different WSP formulations, in general, demand distinct algorithmic approaches for proving a good solution.

Fig. 1.
figure 1

Word Segmentation application domains

Word segmentation tasks also vary in method and in taxonomy according to the application domain, as seen in Fig. 1. Word Alignment (WA) is a task in machine translation used for translating texts from a language to another. Languages that have a high amount of compounds, like German, make this task more difficult, because compounds has to be splitted to find corresponding words to the target language. For example: translate the German compound ‘aktionsplan’ to the English words ‘plan for action’. In these contexts, the WS task is called compound splitting. Program comprehension (PC). In software engineering, WS is used to analyze source code by dividing identifiers such as variable names that can usually be divided into acronyms or understandable parts. For example: ‘printfile’ to ‘print file’. In this context, WS is called identifier splitting. Social analytics (SA). In order to gain a better understanding of the Web, WS can be used to analyse hashtags and domain names (URLs). For example: ‘homesandgardens’ to ‘homes and gardens’. In this context, WS is also employed and can called hashtag splitting or domain name splitting, respectively. Morphological analysis (MA). A word can be analyzed in morphemes in order to understand its formation. For example: sleep-ing, dis-member-ed, etc. WS is also used in this context [3]. Natural language processing (NLP). This is the most general case, in which the input text has been affected by noise [2] such as typos, OCR errors, char-code conversion errors, speech-to-text conversion error, etc.

The methodological framework applied for the development of this work follows the recommendations of Kitchenham [4], which establish a sequence of steps for producing consistent, auditable, and reproducible systematic reviews. The methodology suggested by the authors involves three stages: creating a review protocol, conducting the review, and presenting the results. The following sections reflect this methodology.

2 Review Protocol

We now present the planning stage of the SLR methodology. This section is divided into 4 subsections. Section 2.1 establishes the review questions; Sect. 2.2 presents the keywords and the search strategies; Sect. 2.3 defines the inclusion and the exclusion criteria; and Sect. 2.4 defines a quality evaluation.

2.1 Research Questions

The main objective of the SLR was to answer the following question: “What is the state of the art in WS methods?”. Some more specific questions that unfolded the previous one were formulated: (RQ.1): What are the differences in WS methods in specific contexts? (RQ.2) Which technique performed best in specific contexts? (RQ.3) What is the state of the art in WS in the Portuguese language context?

2.2 Search Strategy

Conducting searches takes into account three primary factors: study sources, search keys, and scope delimitation. Nine study sourcesFootnote 3 were chosen considering previous SLRs and informal conversations with literature review experts: ACM Digital Library (AC), arXiv (AX), Google Search (GO), Google Scholar (GS), IEEE Xplore (IX), Scopus (SC), SpringerLink (SL), Science Direct (SD) and Web of Science (WS). In order to formulate search criteria, we separated the search into three types of aspects: search elements (SE - Table 1), search restrictions (SR - Table 2) and search filters (SF - Table 3). A search string (SS) consists of a combination of SE, and finally of using SR and SF to limit the results, as showed in Table 4.

Table 1. Search elements
Table 2. Search restrictions

It is necessary to apply search restrictions in order to limit the amount of search results to a viable number of works to read. For example, in the Table 2, we use search restrictions to eliminate results outside the desired domain (education, philosophical, etc.) and language (chinese, thai, etc.).

Table 3. SF - Search filters

For each search engine, one or two searches were performed. This was necessary due to the large amount of results in some specific searches and limitations of the search string length. To facilitate the documentation of the searches, a database in JSON format has been editedFootnote 4, as well as a bash script has been createdFootnote 5. These components allow to generate the desired search strings. For example, for repeating the search IX2 - second search in the IEEE XPlore Digital Library, we can execute the command ‘gen-search-string’ as described in Fig. 2.

Fig. 2.
figure 2

Generating a search string with a bash script

Note that only SE and SR items were combined, since the SF value for the example above is ‘None’. In searches that have filters it is necessary to apply them in the web interface of the digital library. For example, the GS2 search has filters F1 and F2. So, it is necessary to select the options ‘publish from 2014 and 2019’ and ‘search content in English’ (see the Table 3). With this approach, it was possible to experiment and apply different search strings in an efficient way.

Table 4. SS - Search strings

2.3 Inclusion/Exclusion Criteria

Inclusion and exclusion criteria were defined for guiding the selection of relevant studies. For a study to be selected, we considered that all inclusion criteria should be met, as well as not meeting any exclusion criteria.

In this sense, we chose the following inclusion criteria: (IC1) having full text available; (IC2) having an abstract; (IC3) being written in English or in Portuguese; (IC4) being a scientific study or a grey literature. As scientific studies, we considered papers, technical reports, surveys, master dissertations and doctoral thesis. As grey literature [4], we included technical reports, preprints, work in progress, software repositories with source codes, and documentations in web portals. For the later, we accepted web portals with relevant publication volume and with good evaluation from their users, or simply by an ad-hoc assessment. The exclusion criteria were: (EC1) not addressing WS; (EC2) addressing specific African language studies; (EC3) addressing specific Asian language studies.

2.4 Quality Assessment (QA)

The following quality assessment questions were devised: (QA1) Are the research context described in the study? (QA2) Is the research methodology clearly explained in the study? (QA3) Is data and performance analysis evidently explained in the study?

For each question, three possible answers were established - Yes, Partially, and No. These answers were assigned to a score of 1, 0.5 and 0.0, respectively. Thus, each study could reach a maximum of 3.0 points and a minimum of 0.0 points. All studies below 2.0 points were disqualified (excluded).

3 Conducting the Review

By using the search strings, we found and downloaded the resulting references in the BibTeX formatFootnote 6. All digital libraries exported to this format except SpringerLink (SL), which references were available only in CSV and had to be converted to BibTex using the csv2bib toolFootnote 7. The SLR was managed using ParsifalFootnote 8 that, in addition to importing the BibTex items, also supported reference duplicate detection, selection, classification and data extraction.

In the selection stage, 771 studies were obtained as candidates. By reading the title and the abstract (when available) of each study, 604 papers were rejected, 89 were detected as duplicates, and 78 were approved. In the classification stage, from the 78 selected papers, 9 were eliminated with a score equal to or below 2.0 points and 69 were classified for data extraction. The data extraction step used a form created according to the Table 5. At this stage, it was necessary to download the full text of all classified studies for a complete reading. The Zotero softwareFootnote 9 was employed for managing and sharing these texts.

Table 5. Data extraction form

The data extracted from the studies at the last step is shown in Table 6, with references available in a BibTeX fileFootnote 10.

Table 6. Data Extracted from the studies

4 SLR Results

In this section, we answer the research questions, based on the extracted data.

4.1 RQ.1: What Are the Differences in WS Methods in Specific Contexts?

According to the data survey, when considering the use of the term ‘compound splitting’ as a specific context of the WS task for segmenting compound words, we obtained 21 studies: 7, 9, 13, 14, 16, 18, 19, 21, 22, 26, 27, 28, 30, 35, 37, 47, 49, 51, 52, 53, 56, 58, 69 – see Table 6. This represents 34.7% of the total amount of papers. There is no occurrence of usage of deep learning techniques in these studies. The most used methods are based on statistical techniques (ST), morphological analysis (MA) and lexical analysis, appearing in 7, 5 and 4 studies, respectively. In the context of this problem, the German language (G) was the one with the highest number of occurrences, as well as in the machine translation application (MT).

In the context of identifier splitting, 14 studies were found (20% from the total): 2, 4, 5, 8, 11, 24, 29, 38, 39, 50, 54, 57 and 59. The techniques of word dictionary (WD) and expand abbreviations (EA) appeared in 4 and 2 studies respectively. Deep learning (DL) was used in two works (29 and 54), and the most used language was English, in all occurrences.

In the more general context, which uses the term ‘word segmentation’, the largest number of studies were found, 34 (49.3%) in total. In this context, DL techniques were more frequent, about 11 studies (32% of the total). When DL is employed, RNN and LSTM techniques prevail, with 7 and 3 occurrences. Otherwise, statistics, POS tagging (PT) and N-Grams (N) techniques are the most frequent ones, with 12, 5 and 5 occurrences respectively.

Figure 3(a) shows the number of the selected scientific production from 1998 to 2019 in each specific word segmentation context (WS, CS, IS). On average, since 1998, there was an increase of the number of studies in the three segmentation contexts. CS and WS received more publications at the period 2016–2019.

4.2 RQ.2: Which Technique Performed Best in Specific Contexts?

To obtain the state of the art of the WS techniques reliably, it is necessary to apply benchmarking on standardized corpus. Common corpus were found for the IS context, but there was no standardization when considering CS and WS.

In Fig. 3(b) we analyzed the occurrence of DL techniques from 1998 to 2019. We note that, since 2010, it has been an increase in DL and a decrease in the use of other approaches, denoting a certain interest of the scientific community in that technique. Thus, we can say that the use of DL is a trend in recent years.

In the IS context, study 29 (see Table 6) presents a state of the art new technique based on deep learning, called CNN-BiLSTM-CRF, that outperformed other techniques such as LINSEN, LIDS and DTW.

In the context of CS, there is no standardized corpus either. In general, metrics are based on the performance of CS usage applied in machine translation, where BLEU was the most used. However, Escartín [1] suggested a way to mediate CS performance using precision, recall and F-measure metrics.

In the context of WS, there is no standardized corpus. However, in studies 12 and 41, there is an attempt to establish comparative metrics, with precision of .906 and .813 respectively. The most commonly cited technique - in studies 44, 20 and 23 - was based on dynamic programming. Study 65 proposes techniques for generating a standardized corpus using Wikipedia. The corpus ‘Google Web Trillion Word Corpus’ in English was cited in study 44. There are other studies that present situations with specific corpus: hashtag splitting (45, 32 and 46) and domain splitting (33, 46 and 10).

Fig. 3.
figure 3

At the left side (a) shows the use of WS, CS and IS from 1998 to 2019 and at the right side (b) shows the use of DL techniques from 1998 to 2019

4.3 RQ.3: What Is the State of the Art in WS in the Portuguese Language Context (PL)?

The authors in [5], developed a way of extract English compounds from the WordNetFootnote 11. The same approach could be used at Portuguese scenario, but we could not find any corpus annotations of compound words in the most recent WordNetsFootnote 12. In order to know how many compound words exist in PL, we extracted 1804 words from a websiteFootnote 13. Most of these words consisted of open compounds (929), when a delimiter character separates the two parts of the word. According to the formal definition (Sect. 1), a problem consisting of a word with delimiter character does not characterize a WSP. In addition, compared with English and German, the number of closed compounds (without a delimiter character) in PL is much lower.

In this SLR, only 9 studies (1, 25, 37, 41, 45, 60, 62, 64 and 67) are considered universal, and only 2 (45 and 60) of them make direct reference to PL. Paper 45 refers to a specific application in hashtags and paper 60 is considered universal. In the studies, no software with direct support to the Portuguese language was found. All of them would need integration with specific training corpus in PL. Therefore, objective data for performance benchmarking are lacking. Considering this information, we can state that, compared to other languages, specific studies of WS for PL are lacking.

5 Discussion and Conclusions

In this SLR, we formally defined the problem of segmenting words written in the Latin Alphabet, present in many application domains and with different denominations. Several contexts were found and enumerated. The most relevant contexts are: word segmentation (WS), identifier splitting (IS) and compound splitting (CS) in natural language processing, software engineering and machine translation domains, respectively. We conducted a survey of techniques employed in each context, as well as a historical analysis of the use of deep learning techniques in recent years. Through data extraction and analysis, we conclude that, for each context, some specific techniques are more often then others. The most mature context in establishing a state of the art with standardized corpus is the IS. In the other contexts (CS and WS), there is no standard corpus.