Using FLOSS for Storing, Processing and Linking Corpus Data

Mukhamedshin, Damir; Nevzorova, Olga; Kirillovich, Alexander

doi:10.1007/978-3-030-47240-5_17

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 582))

Included in the following conference series:

IFIP International Conference on Open Source Systems

1635 Accesses
2 Citations

Abstract

Corpus data is widely used to solve different linguistic, educational and applied problems. The Tatar corpus management system (http://tugantel.tatar) is specifically developed for Turkic languages. The functionality of our corpus management system includes a search of lexical units, morphological and lexical search, a search of syntactic units, a search of N-grams and others. The search is performed using open source tools (database management system MariaDB, Redis data store). This article describes the process of choosing FLOSS for the main components of our system and also processing a search query and building a linked open dataset based on corpus data.

You have full access to this open access chapter, Download conference paper PDF

Choosing the Right Storage Solution for the Corpus Management System (Analytical Overview and Experiments)

cqp4rdf: Towards a Suite for RDF-Based Corpus Linguistics

An Automatic Approach to Generate Corpus in Spanish

Keywords

1 Introduction

In this paper, we discuss the development of the corpus management system for the Tatar National Corpus “Tugan Tel” [1]. The corpus is organized as a collection of texts covering different genres, such as fiction, news, science, official, etc. A text consists in sentences (called contexts). The words from a sentence are provided with morphological annotation. The annotation of a word includes the lemma, POS and the sequence of grammatical features expressed by the affixes.

Access to the corpus is provided by the corpus management system. The system allows users to search for word tokens with a specified full form or lemma and grammatical features. The search results are represented as a list of the sentences containing the found word tokens. Figure 1 shows the search result for the word kuman ‘book’ in the plural number and the genitive or the directive case. The tooltip under one of the found tokens contains its morphological annotation.

The general architecture of the corpus management system is represented at Fig. 2. This model includes three main components: a web interface, a search engine, and a database. The system imports the annotated texts from the corpus annotation tool, and exports them to a triplestore of the LLOD publishing platform.

Development of the corpus management system cannot begin without a clear understanding of which tools will be used to store and process corpus data. Therefore, one of the important factors when choosing and using tools for data storing and processing is for us their open source code, which is fundamental to ensure transparency and flexibility of the tool.

When choosing tools for processing and storing corpus data, we were faced the task of finding a set of FLOSS to ensure high speed of performing search queries (no more than 0.1 seconds for direct search queries, no more than 1 second for reverse search queries), wide search capabilities (at least direct and reverse search, search in parts of word forms and lemmas, mixed search, phrase search), and the possibility of further growth of system performance and functionality.

In Sect. 2 we describe using of FLOSS in the corpus management system, and in Sect. 3 we describe using of FLOSS the LLOD publishing platform.

2 Using FLOSS in Corpus Management System

To choose FLOSS for storing and processing of corpus data, we formed a list of the most important selection criteria:

1.
Performance. The solution should produce an average of at least 10 search operations per second.
2.
Functionality. The solution should provide a possibility of direct and reverse search by wordforms and lemmas, search by parts of wordforms and lemmas, and phrase search.
3.
Compatibility with other software. FLOSS for data storage should work with FLOSS for data processing and vice versa.
4.
Completeness of documentation and presence of a large community. Everyone should be able to continue its development, focusing on the documentation and the accumulated experience of the community.
5.
Development prospects. The solution should develop and maintain modern technologies.

We analyzed information in public sources [2] and chose FLOSS according to criteria 2–5. The resulting FLOSS set is presented in Table 1.

Table 1. Chosen FLOSS for storing and processing corpus data.

Full size table

To verify compliance with the first criterion, we conducted a series of experiments on writing and searching based on the generated data [3]. The best results were shown by the MySQL + Redis suite, which was chosen by us to store data in the corpus management system.

To build the system based on the selected components, additional elements must be included. So, to bind the PHP interpreter and the Redis data warehouse, we use the PhpRedis extension. In order for the PHP interpreter and MySQL DBMS to work in tandem, we use the php-mysql package, namely the mysqlnd driver, which allows working with the DBMS using its own low-level protocol.

Thus, the scripts executed by the PHP interpreter allow operations with data both from the Redis data storage and from the MySQL DBMS. Performing operations in the order necessary to solve the tasks assigned to the corpus data management system, PHP scripts are the link between indexes and cached data stored in Redis storage, index tables and text data stored in MySQL DBMS.

Let us consider how the processing of a simple direct search query alma _tat/apple _en is executed. The process execution of such query is shown in Fig. 3.

First of all, the script that performs this task searches the identifier of the word-form alma in the Redis database, by making a request using the PhpRedis extension. Already at this point, the script may report an error in the query, if the identifier of the desired wordform is not found in the corpus data.

In the second step, the script makes a request to the MySQL database using mysqlnd. In this case, the script needs to get a list of sentences containing the desired wordform. This data is stored in the index table of corpus data. In response, the script receives a list of context (sentence) identifiers, according to which in the third step the required number of sentence texts is requested. The received data is displayed to an user in the form of an HTML document or a JSON structure.

3 Building a Linked Open Dataset Based on Corpus Data

The corpus has been published on the Linguistic Linked Open Data cloud [4]. The dataset is represented in terms of NIF [5], OntoLex-Lemon [6], LexInfo, OLiA [7] and MMoOn [8] ontologies.

LLOD publishing platform consists in the following open-source components (Fig. 4). The RDF is stored in the OpenLink Virtuoso triplestore. The dataset is available via deferrable URI’s and SPARQL endpoint. Deferrable URI’s are accessible throws the LodView RDF browser, based on the Apache Tomacat webserver. The query interface is powered by YASQE [9].

Publication of the corpus on the LLOD cloud makes possible its interlinking with the external linguistic resources for Tatar, including Russian-Tatar Socio-Political Thesaurus [10], TatWordNet and TatVerbBank.

4 Conclusion

The solutions presented in this article are applied in the developed corpus management system. Using FLOSS significantly reduced the development time and ensured the transparency of processes, flexibility, and possibility of in-depth analysis, as well as opportunities for further development.

The corpus data management system is used to work with Tatar texts. The total volume of the collection of Tatar texts is about 200 million wordforms. The average execution time of the direct search query does not exceed 0.05 s in 98% of cases, and the reverse search is performed by the system within 0.1 s in 82% of cases, which exceeded the expected system performance. In many ways, such performance is provided by FLOSS, used for data storage and processing. Also, thanks to FLOSS, search capabilities were expanded in the system. The search using category lists, the complex search using logical expressions, the search for named entities and other functions were added to the advanced version of the system.

Currently, the corpus management system is in open beta testing and available online at http://tugantel.tatar. After that, we are going to release it under an open license.

References

Suleymanov, D., Nevzorova, O., Gatiatullin, A., Gilmullin, R., Khakimov, B.: National corpus of the Tatar language “Tugan Tel”: grammatical annotation and implementation. In: Vargas-Sierra, C., (ed.). Selected Papers from the 5th International Conference on Corpus Linguistics, (CILC2013) [Special issue]. Proc. Soc. Behav. Sci. 95, 68–74 (2013). https://doi.org/10.1016/j.sbspro.2013.10.623
Katkar, M., Kutchhii, S., Kutchhii, A.: Performance analysis for NoSQL and SQL. Int. J. Innov. Emerg. Res. Eng. 2(3), 12–17 (2015)
Google Scholar
Mukhamedshin, D., Suleymanov, D., Nevzorova, O.: Choosing the right storage solution for the corpus management system (analytical overview and experiments). In: Bouhlel, M.S., Rovetta, S. (eds.) SETIT 2018. SIST, vol. 146, pp. 105–114. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-21005-2_10
Chapter Google Scholar
Cimiano, P., Chiarcos, C., McCrae, J.P., Gracia, J.: Linguistic linked open data cloud. In: Cimiano, P., et al. (eds.) Linguistic Linked Data: Representation, Generation and Applications, pp. 29–41. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-30225-2_3
Chapter Google Scholar
Hellmann, S., Lehmann, J., Auer, S., Brümmer, M.: Integrating NLP using linked data. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X. (eds.) ISWC 2013. LNCS, vol. 8219, pp. 98–113. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41338-4_7
Chapter Google Scholar
McCrae, J.P., Bosque-Gil, J., Gracia, J., Buitelaar, P., and Cimiano, P.: The OntoLex-lemon model: development and applications. In: Kosem I., et al. (eds.) Proceedings of the 5th biennial conference on Electronic Lexicography (eLex 2017), pp. 587–597. Lexical Computing CZ (2017)
Google Scholar
Chiarcos, C.: OLiA – ontologies of linguistic annotation. Seman. Web 6(4), 379–386 (2015). https://doi.org/10.3233/SW-140167
Article Google Scholar
Klimek, B., Arndt, N., Krause, S., Arndt, T.: Creating Linked data morphological language resources with MMoOn - the hebrew morpheme inventory. In: Calzolari N., et al. (eds.) Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 892–899. ELRA (2016)
Google Scholar
Rietveld, L., Hoekstra, R.: The YASGUI Family of SPARQL Clients. Seman. Web 8(3), 373–383 (2017). https://doi.org/10.3233/SW-150197
Article Google Scholar
Galieva, A., Kirillovich, A., Khakimov, B., Loukachevitch, N., Nevzorova, O., Suleymanov, D.: Toward domain-specific Russian-Tatar thesaurus construction. In: Proceedings of the International Conference IMS-2017, pp. 120–124. ACM (2017). https://doi.org/10.1145/3143699.3143716

Download references

Acknowledgements

The work was funded by Russian Science Foundation according to the research project no. 19-71-10056.

Author information

Authors and Affiliations

Kazan Federal University, Kazan, Russia
Damir Mukhamedshin, Olga Nevzorova & Alexander Kirillovich

Authors

Damir Mukhamedshin
View author publications
You can also search for this author in PubMed Google Scholar
Olga Nevzorova
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Kirillovich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Olga Nevzorova .

Editor information

Editors and Affiliations

Innopolis University, Innopolis, Russia
Vladimir Ivanov
Innopolis University, Innopolis, Russia
Artem Kruglov
Innopolis University, Innopolis, Russia
Sergey Masyagin
Innopolis University, Innopolis, Russia
Alberto Sillitti
Innopolis University, Innopolis, Russia
Giancarlo Succi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mukhamedshin, D., Nevzorova, O., Kirillovich, A. (2020). Using FLOSS for Storing, Processing and Linking Corpus Data. In: Ivanov, V., Kruglov, A., Masyagin, S., Sillitti, A., Succi, G. (eds) Open Source Systems. OSS 2020. IFIP Advances in Information and Communication Technology, vol 582. Springer, Cham. https://doi.org/10.1007/978-3-030-47240-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-47240-5_17
Published: 05 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-47239-9
Online ISBN: 978-3-030-47240-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Using FLOSS for Storing, Processing and Linking Corpus Data

Abstract

Similar content being viewed by others

Choosing the Right Storage Solution for the Corpus Management System (Analytical Overview and Experiments)

cqp4rdf: Towards a Suite for RDF-Based Corpus Linguistics

An Automatic Approach to Generate Corpus in Spanish

Keywords

1 Introduction

2 Using FLOSS in Corpus Management System

3 Building a Linked Open Dataset Based on Corpus Data

4 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Using FLOSS for Storing, Processing and Linking Corpus Data

Abstract

Similar content being viewed by others

Choosing the Right Storage Solution for the Corpus Management System (Analytical Overview and Experiments)

cqp4rdf: Towards a Suite for RDF-Based Corpus Linguistics

An Automatic Approach to Generate Corpus in Spanish

Keywords

1 Introduction

2 Using FLOSS in Corpus Management System

3 Building a Linked Open Dataset Based on Corpus Data

4 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation